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Preface 


MOTIVATION FOR THE BOOK 

The recent resurgence of interest in neural networks has its roots in the 
recognition that the brain performs computations in a different manner than do 
conventional digital computers. Computers are extremely fast and precise at ex- 
ecuting sequences of instructions that have been formulated for them. A human 
information processing system is composed of neurons switching at speeds about 
a million times slower than computer gates. Yet, humans are more efficient 
than computers at computationally complex tasks such as speech understanding. 
Moreover, not only humans, but even animals, can process visual information 
better than the fastest computers. 

The question of whether technology can benefit from emulating the computa- 
tional capabilities of organisms is a natural one. Unfortunately, the understanding 
of biological neural systems is not developed enough to address the issues of 
functional similarity that may exist between the biological and man-made neural 
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systems. As a result, any major potential gains derived from such functional 
similarity, if they exist, have yet to be exploited. 

This book introduces the foundations of artificial neural systems. Much of 
the inspiration for such systems comes from neuroscience. However, we are not 
directly concemed with networks of biological neurons in this text. Although 
the newly developed paradigms of artificial neural networks have strongly con- 
tributed to the discovery, understanding, and utilization of potential functional 
similarities between human and artificial information processing systems, many 
questions remain open. Intense research interest persists and the area continues 
to develop. The ultimate research objective is the theory and implementation of 
massively parallel interconnected systems which could process the information 
with an efficiency comparable to that of the brain. 

To achieve this research objective, we need to define the focus of the study of 
artificial neural systems. Artificial neural systems , or neural networks , are physi- 
cal cellular systems which can acquire, store , and utilize experiential knowledge. 
The knowledge is in the form of stable States or mappings embedded in networks 
that can be recalled in response to the presentation of cues. This book focuses on 
the foundations of such networks. The fundamentals of artificial neural systems 
theory, algorithms for information acquisition and retrieval, examples of appli- 
cations, and implementation issues are included in our discussion. We explore a 
rather new and fundamentally different approach to computing and information 
processing. 

Programmed computing, which has dominated information processing for 
more than four decades, is based on decision rules and algorithms encoded into 
the form of computer programs. The algorithms and program-controlled com- 
puting necessary to operate conventional computers have their counterparts in 
the leaming rules and information recall procedures of a neural network. These 
are not exact counterparts, however, because neural networks go beyond digital 
computers since they can progressively alter their processing structure in response 
to the information they receive. 

The purpose of this book is to help the reader understand the acquisition and 
retrieval of experiential knowledge in densely interconnected networks containing 
cells of processing elements and interconnecting links. Some of the discussed net- 
works can be considered adaptive; others acquire knowledge a priori , Retrieval 
of knowledge is termed by some authors as neural computation. However, “neu- 
ral computation” is only a segment of the artificial neural systems focus. This 
book also addresses the concepts of parallel machines that are able to acquire 
knowledge, and the corresponding issues of implementation. 

I believe that the field of artificial neural systems has evolved to a point 
where a course on the subject is justified. Unfortunately, while the technical 
literature is full of reports on artificial neural systems theories and applications, 
it is hard to find a complete and unified description of techniques. The begin- 
ning student is likely to be bewildered by different levels of presentations, and 
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a widespread spectrum of metaphors and approaches presented by authors of 
diverse backgrounds. 

This book was conceived to satisfy the need for a comprehensive and unified 
text in this new discipline of artificial neural systems. It brings students a fresh 
and fascinating perspective on computation. The presentation of the material fo- 
cuses on basic system concepts and involves leaming algorithms, architectures, 
applications, and implementations. The text grew out of the teaching elfort in 
artificial neural systems offered for electrical engineering and computer science 
and engineering senior and graduate students at the University of Louisville. 
In addition, the opportunity to work at Princeton University has considerably 
influenced the project. 

This book is designed for a one semester course. It is written at a com- 
prehensible level for students who have had calculus, linear algebra, analytical 
geometry, differential equations, and some exposure to optimization theory. As 
such, the book is suitable for senior-level undergraduate or beginning graduate 
students. Due to the emphasis on systems, the book should be appropriate for 
electrical, computer, industrial, mechanical, and manufacturing engineering stu- 
dents as well as for computer and information science, physics, and mathematics 
students. Those whose major is not electrical or computer engineering may find 
Sections 9.2 to 9.5 superfluous for their study. They can also skip Sections 5.4 
to 5.6 without any loss of continuity. 

With the mathematical and programming references in the Appendix, the 
book is self-contained. It should also serve already-practicing engineers and sci- 
entists who intend to study the neural networks area. In particular, it is assumed 
that the reader has no experience in neural networks, leaming machines, or pattem 
recognition. 


PHILOSOPHY 

The fundamentals of artificial neural systems are the main concem of this 
book. Concepts are introduced gradually. The basic neural network paradigms 
are conveyed using a system perspective and mathematics exposition suitable for 
thorough understanding. Although only some of the algorithms are provided with 
proofs, the book uses mathematical exposition at the depth essential for artificial 
neural systems implementation and simulation, and later, initiation of research in 
the area. The reader is not only shown how to get a result, but also how to think 
about and possibly extend it. 

Neural network processing typically involves dealing with large-scale prob- 
lems in terms of dimensionality, amount of data håndled, and the volume of 
simulation or neural hardware processing. This large-scale approach is both essen- 
tial and typical for real-life applications. Although the study of such applications 
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show dramatically what has been done, a student needs to find out how and why 
it has happened. Peeling off the complexity of large-scale approaches, diversity 
of applications, and overlapping aspects can facilitate the study. 

When serious readers look behind the easy articles on neural networks in 
popular books or magazines, they may notice an involved mosaic of theories 
taken from mathematics, physics, neurobiology, and engineering. Fortunately, 
there are unified and pedagogical approaches to studying such complex subjects 
as neural networks. In order to understand digital computer organization and 
architecture, for example, one needs to study basic axioms of Boolean algebra, 
binary arithmetic, sequential circuits, control organization, microprogramming, 
etc. A similar decomposition of the neural network knowledge domain into basic 
but coherent parts should simplify the study. Such an approach is employed in 
this text. 

An additional difficulty in studying the field arises from the large num- 
ber of publications appearing in the neural network field. They make use of a 
variety of mathematical approaches, notations, and terminologies. Authors with 
diverse backgrounds bring a myriad of perspectives from numerous disciplines, 
ranging from neurobiology and physics to engineering and computer science. 
While the field has benefitted from having experts with a variety of backgrounds 
and perspectives, the resulting differences in approach have made it difficult to 
see the underlying similarities of substance. To blend interdisciplinary aspects of 
the discipline, I present a unified perspective, and also link the approaches and 
terminologies among them. 

As the reader will hopefully notice, the fundamental concepts of neural pro- 
cessing are often rather lucid. Because they combine simplicity and power, they 
are also appealing. The approach often taken in this text is to reduce both high 
dimensionality and data volume whenever possible without loss of generality, and 
to demonstrate algorithms in two- or three-dimensional Space and with a small 
amount of data. An algorithm is often introduced along with an explanatory 
example showing how and why it works. Later on, the algorithm is formally 
outlined and formulated for any size of the problem. Some proofs of algorithms 
are included; others are omitted for simplicity. Some proofs of neural algorithms, 
however, have yet to be constructed. 

Examples within chapters usually follow this philosophy of material pre- 
sentation. I feel strongly that illustrative examples need to be made available to 
readers to develop intuitive understanding and to facilitate study. The end-of- 
chapter problems typically also have a reduced size and complexity. They focus 
on enhancing the understanding of principles; some of them are short proofs. 
I also believe that solving problems should enhance the understanding of algo- 
rithm principles and related concepts. Since the use of existing algorithms does 
not contribute much to the understanding of the subject, only a limited number 
of end of chapter problems is devoted to the “simulate and see” approach. 
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At the same time, I realize that the practical significance of neural compu- 
tation becomes apparent for large or very large-scale tasks. To illustrate neural 
networks practical applications, computationally intensive problems are also in- 
cluded at the ends of most chapters. Such problems are highlighted with asterisks 
and can be considered mini-projects. Their size is usually between the “pencil 
and paper” problem size and the large-scale application. Because some of these 
problems are rather computationally intensive, the coding of algorithms or the use 
of the ANS program (available with this text) or other software is recommended 
for their solution. Unless software is available, the mini-project problems usually 
take much more time to solve than other problems. 

Problems marked with asterisks also tend to overlap somewhat throughout 
the chapters with regard to input data, training sets, and similar aspects, but the 
algorithms used differ among problems in various chapters. As the course has 
been developed, students should have created an integrated program package for 
selected neural algorithms by the end of a semester. The input of the package 
includes user-specified problem size, choice of an algorithm, entering training 
data, training the network, and commands for running the program for test data 
in the recall mode. Obviously, neither the package code nor the applications of 
the algorithms are optimized, but they are not meant to be, since they address a 
Spectrum of neurocomputing paradigms for project work. 

Rather than focusing on a number of algorithms, another alternative for 
project content of the course is proposed. Students can be assigned one large- 
scale project, which can be the outgrowth, or fusion, of the computationally 
intensive end-of-chapter problems. In this mode, instead of using the Spectrum 
of several algorithms, one neural processing algorithm is covered in depth. Alter- 
natively, one practical application problem can be studied and approached using 
more than one paradigm. Study of the literature referenced at the end of each 
chapter, or extension of applications covered in Chapter 8, can provide the reader 
with more guidelines for further work. 


SUMMARY BY CHAPTERS 

Chapter 1 presents a number of neural network examples and applications. 
It provides an overview of what neural networks can accomplish. Brief historical 
remarks of artificial neural network development are presented, followed by a 
future outlook of the discipline. 

Chapter 2 introduces the terminology and discusses the most important 
neural processing paradigms, network properties, and leaming concepts. Basic 
assumptions, definitions, and models of neurons and neural architectures are in- 
troduced in this chapter. The chapter concludes with the analysis of networks 
and their leaming mies. 
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The foundations of supervised leaming machines are presented in Chap- 
ter 3. Feedforward single-layer architectures are studied. The concept of the 
nonparametric training of a linear machine composed of discrete perceptrons is 
introduced. The training is discussed in geometric terms, and the steepest descent 
minimization of the classification error function is accomplished for a network 
with continuous perceptrons. 

The concept of error function minimization is extended in Chapter 4 to 
multilayer feedforward neural networks. Emphasis is placed on the pattem-to- 
image space mapping provided through the first network layer. The generalized 
delta leaming rule is derived based on the steepest descent minimization rule. 
The error back-propagation algorithm is discussed and illustrated with examples. 
Application examples of multilayer feedforward networks are also covered. The 
chapter ends with presentation of functional link networks. 

Chapter 5 covers the foundations of recurrent networks. Discrete-time and 
continuous-time gradient-type networks are introduced. The time-domain behav- 
ior of single-layer feedback networks is emphasized. Particular attention is paid 
to the energy minimization property during the evolution of the system in time, 
and to the use and interpretation of trajectories in the output space. A comprehen- 
sive example involving analog-to-digital converters and an optimization problem 
solving using gradient-type networks are presented to illustrate the theory. 

Associative memories are introduced in Chapter 6. The exposition proceeds 
from the linear associator to single-layer recurrent networks. Two-layer recurrent 
networks, known as bidirectional associative memories, are introduced. They 
are then generalized to perform as multidirectional and spatio-temporal memo- 
ries. Performance evaluation of single- and double-layer memory is covered and 
illustrated with examples. Memory capacity and retrieval performance versus 
distances between stored pattems are discussed. 

Chapter 7 covers networks that for the most part leam in an unsupervised 
environment. These networks typically perform input vector matching, cluster- 
ing or feature extraction. Diverse architectures covered in this chapter include 
the Hamming networks, MAXNET, clustering and counterpropagation networks, 
self-organizing feature mapping networks, and adaptive resonance networks. 

Chapter 8 provides examples of neural network applications for computa- 
tion, control, and information processing. An optimization network for a linear 
programming task is discussed. Character recognition architectures using hidden- 
neuron layers are described, and applications in robotics and control systems are 
reviewed. Other application examples treat the implementation of connectionist 
expert systems for medical diagnosis. Such systems first acquire and then use 
the kno wledge base to diagnose a patient’s disease when provided with a list 
of symptoms. The application of a self-organizing feature mapping network for 
semantic map processing is covered. 

The final chapter, Chapter 9, is devoted to hardware implementation issues. 
A brief review of digital neurocomputing and simulation issues is followed by an 
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extensive discussion of microelectronic implementation techniques. The emphasis 
of this chapter is on adaptive analog and leaming Circuit configurations which 
can be implemented in MOS technology. 

The Appendix contains essential mathematical references needed for the un- 
derstanding of neural network paradigms covered in the book. It also contains the 
listing of the main procedures of the program ANS in Pascal, which is capable 
of running on IBM-PC compatible microcomputers. The program implements 
most computational algorithms discussed in the book, and can provide numerical 
solutions to a number of examples and problems contained in the text. 
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Notation and Abbreviations 


ANS 

ART 

BAM 

C 

CPS 

d 

D 

det A 

e 

E 

E 


EBPTA 


/(•) 

f(net) 

f (/) 


/sat+ ’fsa 


F 

g(x) 

G 

h(x) 

H 

HD 


H( w,x) 

W 


I 

I 

/+ 

/_ 

I* 

J 

J 

k 

k! 

K 

KCL 

L 


— artificial neural systems simulation program from appendix page A9 

— adaptive resonance theory 

— bidirectional associative memory 

— capacitance matrix of Hopfield network (diagonal) 

— connections per second 

— desired output vector of a trained network 

— digital control word 

— determinant of matrix A 

— error 

— error function to be minimized during leaming (for feedforward 
networks) 

— energy function to be minimized during recall (for feedback 
networks) 

— ( E can be termed as a loss, or an objective function) 

— error back-propagation training algorithm 

— neuron’s activation operator 

— activation function 

— memory forced response vector 

— operational amplifier saturation voltages 

— memory forced response matrix 

— discriminant function 

— conductance matrix of Hopfield network (diagonal) 

— multivariable function to be approximated 

— Hessian matrix (also V 2 ) 

— Hamming Distance 

— neural network-produced approximation of function h(x) 

— decision function 

— network input current vector (bias vector) 

— number of input nodes of multilayer feedforward network 

— identity matrix 

— excitatory current 

— inhibitory current 

— drain-to-source current 

— number of hidden nodes of multilayer feedforward network 

— Jacobian matrix 

— device transconductance of MOS transistor 

— process transconductance of MOS transistor 

— number of output nodes of multilayer feedforward network 

— Kirchhoff ’s current law 

— channel length of MOS transistor 
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MDAC — multiplying digital-to-analog converter 

n — normal vector to hyperplane 

net — activation vector for neuron layer (general notation) 

net — activation value of a neuron 

o — output vector of a neuron layer 

P — number of pattems in the training set 

r — leaming signal 

r — unit normal vector to hyperplane 

RAMSRA — recurrent associative memory storage and retrieval algorithm 
RDPTA — Æ-category discrete perceptron training algorithm 
R ds — tunable weight resistance produced by MOS transistor 

r(t) — reference signal (Chapter 8) 

sgn(x) — signum function 

— vector stored in memory (or classifier) 

S — vectors stored in memory, arranged in matrix 

SCPTA — single continuous perceptron training algorithm 
SDPTA — single discrete perceptron training algorithm 
T — neuron (or perceptron) threshold 

TLU — threshold logic unit 

u — activation vector (only for Hopfield layer) 

v — output vector of continuous-time feedback or memory network 

V — weight matrix (v,y is the weight from node j toward node i) 

v ds , V gs — drain-to-source and gate-to-source voltage of MOS transistor, 
respectively 

v 0 — output voltage 

V lh , V dep — threshold, depletion voltages of MOS transistor, respectively 

VTC — voltage transfer characteristics 

w — weight vector (or augmented weight vector) 

W — channel width of MOS transistor 

W — weight matrix (w,y is the weight from node j toward node i) 

w' — updated weight vector (or augmented weight vector) 

w — normalized weight vector 

x — input pattem vector 

y — input vector, or augmented input vector, also output of hidden 

layer neuron 

y k — matching score at input of MAXNET 

z — input vector for multilayer network 

Greek 

a — leaming constant 

J3 — leaming constant 


*— — C: o Vd 


xxiv 


PREFACE 


r[*] — nonlinear diagonal matrix operator with operators /(*) on the 

diagonal 

— error signal vector 

— error signal vector of hidden layer 

, — error signal vector of output layer 

; — Kronecker delta function 

— leaming constant 

— memory output noise vector 

* #2 — joint angles of planar robot manipulator 

— gain (steepness) factor of the continuous activation function, also 
used in Chapter 9 to denote the channel length modulation factor 

p — vigilance test level (Chapter 7) (also used as measure of distance 

in Chapter 2, and as radius of attraction Chapter 6) 

\E(x) — gradient of a multivariable function E(x) with respect to x 

V x 2 £(x) — Hessian matrix of a multivariable function E(x) with respect to x 



ARTIFICIAL NEURAL 
SYSTEMS: PRELIMINARIES 

Happy those who are convinced so 
as to be of the general opinions . 

LORD HALIFAX 

1.1 Neural Computation: Some Examples and Applications 

1.2 History of Artificial Neural Systems Development 

1.3 Future Outlook 


F or many centuries, one of the goals of humankind has been to develop 
machines. We envisioned these machines as performing all cumbersome 
and tedious tasks so that we might enjoy a more fruitful life. The era of machine 
making began with the discovery of simple machines such as lever, wheel and 
pulley. Many equally congenial inventions followed thereafter. Nowadays engi- 
neers and scientists are trying to develop intelligent machines. Artificial neural 
systems are present-day examples of such machines that have great potential to 
further improve the quality of our life. 

As mentioned in the preface, people and animals are much better and faster 
at recognizing images than most advanced computers. Although computers out- 
perform both biological and artificial neural systems for tasks based on precise 
and fast arithmetic operations, artificial neural systems represent the promising 
new generation of information processing networks. Advances have been made 
in applying such systems for problems found intractable or difficult for traditional 
computation. Neural networks can supplement the enormous processing power 
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of the von Neumann digital computer with the ability to make sensible decisions 
and to leam by ordinary experience, as we do. 

A neural network’s ability to perform computations is based on the hope that 
we can reproduce some of the flexibility and power of the human brain by arti- 
ficial means. NetWork computation is perf ormed by a dense mesh of computing 
nodes and connections. They operate collectively and simultaneously on most or 
all data and inputs. The basic processing elements of neural networks are called 
artificial neurons, or simply neurons. Often we simply call them nodes. Neurons 
perform as summing and nonlinear mapping junctions. In some cases they can 
be considered as threshold units that fire when their total input exceeds certain 
bias levels. Neurons usually operate in parallel and are configured in regular 
architectures. They are often organized in layers, and feedback connections both 
within the layer and toward adjacent layers are allowed. Each connection strength 
is expressed by a numerical value called a weight, which can be modified. 

We have just introduced several new words describing some of the attributes 
of neural processing. These, and other notions introduced bter in this chapter, 
are not yet rigorously defined. For now we will use them to discuss qualitatively 
several introductory examples and applications of neurocomputing. A more de- 
tailed exposition of concepts and terminology is provided in Chapter 2. Let us 
only mention that the artificial neural systems field alone goes under the guise of 
many names in the literature. The effort is not only called neural networks, but 
also neurocomputing, network computation, connectionism, parallel distributed 
processing, layered adaptive systems, self-organizing networks, or neuromorphic 
systems or networks. The reader may not need to assimilate all new technical 
jargon terms related to neurocomputing. Rather, the variety of names indicates 
how many different perspectives can be taken when studying a single subject of 
interest such as neural networks. 

Artificial neural systems function as parallel distributed computing networks. 
Their most basic characteristic is their architecture. Only some of the networks 
provide instantaneous responses. Other networks need time to respond and are 
characterized by their time-domain behavior, which we often refer to as dynamics. 
Neural networks also differ from each other in their leaming modes. There are 
a variety of leaming rules that establish when and how the connecting weights 
change. Finally, networks exhibit different speeds and efficiency of leaming. As a 
result, they also differ in their ability to accurately respond to the cues presented 
at the input. 

In contrast to conventional computers, which are programmed to perform 
specific tasks, most neural networks must be taught, or trained. They leam new 
associations, new pattems, and new functional dependencies. Leaming corre- 
sponds to parameter changes. As will be shown in subsequent chapters, leaming 
mies and algorithms used for experiential training of networks replace the pro- 
gramming required for conventional computation. Neural network users do not 
specify an algorithm to be executed by each computing node as would program- 
mers of a more traditional machine. Instead, they select what in their view is the 


1.1 NEURAL COMPUTATION: SOME EXAMPLES AND APPLICATIONS 


3 


best architecture, specify the characteristics of the neurons and initial weights, and 
choose the training mode for the network. Appropriate inputs are then applied 
to the network so that it can acquire knowledge from the environment. As a 
result of such exposure, the network assimilates the information that can later be 
recalled by the user. 

Artificial neural system computation lies in the middle ground between en- 
gineering and artificial intelligence. The mathematical techniques utilized, such 
as mean-square error minimization, are engineering-like, but an experimental ad 
hoc approach is often necessary. Heuristic methodology and “quasi-rigorous” 
techniques of network leaming are needed, since often no theory is available for 
selection of the appropriate neural system for a specific application. 

Much of this book is devoted to answering questions, such as the folio wing: 
How can a network be trained efficiently and why does it leam? What models of 
neurons need to be used for good performance? What are the best architectures for 
certain classes of problems? What are the best ways of extracting the knowledge 
stored in a network? What are the typical applications of neural computation and 
how accurate are they? How can a specific task be accomplished if it is modeled 
through a neural network? How can artificial neural systems be implemented 
as application-specific integrated circuits? We realize that our main objective of 
study is to design efficient neural systems. However, a strong understanding of 
the principles and behavior of networks is invaluable if a good design is to 
be developed. 

Neural networks have attracted the attention of scientists and technologists 
from a number of disciplines. The discipline has attracted neuroscientists who are 
interested in modeling biological neural networks, and physicists who envisage 
analogies between neural network models and the nonlinear dynamical systems 
they study. Mathematicians are fascinated by the potential of mathematical mod- 
eling applied to complex large systems phenomena. Electrical and computer 
engineers look at artificial neural systems as computing networks for signal pro- 
cessing. They are also interested in budding electronic integrated circuit-based 
intelligent machines. Psychologists look at artificial neural networks as possi- 
ble prototype structures of human-like information processing. Finally, computer 
scientists are interested in opportunities that are opened by the massively paral- 
lel computational networks in the areas of artificial intelligence, computational 
theory, modeling and simulation, and others. 


1.1 _________ 

NEURAL COMPUTATION: 

SOME EXAMPLES 
AND APPLICATIONS 

In the previous section we read about interesting and novel aspects of neural 
computation. We now focus on the applicability and usefulness of neural network 
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technology in the solving of real-world problems. Below, we demonstrate sam- 
ple applications and relate them to simplified neural network examples. Because 
mathematical exposition and more precise explanation of concepts are addressed 
starting in Chapter 2, this discussion is descriptive rather than quantitative. 


Classifiers, Approximators, 
and Autonomous Drivers 


This portion of our neural networks discussion introduces networks that 
respond instantaneously to the applied input. Let us try to inspect the perfor- 
mance of a simple classifier, and then also of similar networks that are far more 
complex, but can be designed as extensions of the classifier. 

Assume that a set of eight points, P 0 , P lf ..., P 7 , in three-dimensional 
space is available. The set consists of all vertices of a three-dimensional cube as 
follows: 

{P 0 (-l, -1, -1),P,(-1, -1, 1),P 2 (-1, 1, -1),P 3 (-1, 1. 1). 

P 4 (l, -1, -1),P 5 (1, -1, 1),P 6 (1, 1, -1),P 7 (1, 1, 1)} 


Elements of this set need to be classified into two categories. The first 
category is defined as containing points with two or more positive ones; the 
second category contains all the remaining points that do not belong to the first 
category. Accordingly, points P 3 , P 5 , P 6 , and P 7 belong to the first category, and 
the remaining points to the second category. 

Classification of points P 3 , P 5 , P 6 , and P 7 can be based on the summation 
of coordinate values for each point evaluated for category membership. Notice 
that for each point P, (x,, jc 2 , x 3 ), where i = 0, .... 7, the membership in the 
category can be established by the following calculation: 


If sgn (xj + x 2 + 



1 , 

- 1 , 


then category 1 
then category 2 


( 1 . 1 ) 


Expression (1.1) describes the decision function of the classifier designed 
by inspection of the set that needs to be partitioned. No design formulas or train- 
ing have been required to accomplish the design, however. The resulting neural 
network shown in Figure 1.1 (a) is extremely simple. It implements expression 
(1.1) by summing the inputs x x , jc 2 , and x 3 with unity weighting coefficients. The 
weighted sum undergoes a thresholding operation and the output of the unit is 1 
for x x + x 2 + x 3 > 0, otherwise it is —1. We thus achieve the desired classifi- 
cation using a single unit, or computing node. The unit implements summation 
with respective weights followed by the thresholding operation. 

Looking at the network just discussed and taking a different perspective, 
we may notice that in addition to classification, the network performs three- 
dimensional Cartesian space partitioning as illustrated in Figure 1.1 (b). The 
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(a) 

Figure 1.1 Partitioning of the set 
partitioning. 



cube vertices: (a) single-unit diagram and (b) implemented 


partitioning plane is x x + x 2 + * 3 = 0. All points above it are mapped into + 1 
and all below it are mapped into -1. Although this observation is not crucial 
for our understanding of classifier design, insight into the geometry of mapping 
will prove to be very valuable for more complex cases. 

Design of neural network classifiers becomes far more involved and intrigu- 
ing when requirements for membership in categories become complicated. No 
single-unit classifier exists, for example, that would implement assignment of P 2 , 
P 3 , and P 5 into a single category, and of the remaining five points of the set into 
the other category. The details of the classification discussion, however, will be 
postponed until Chapter 3, since they require more formal coverage. 

As stated, the unit from Figure 1.1 (a) maps the entire three-dimensional 
space into just two points, 1 and —1. A question arises as to whether a unit 
with a “squashed” sgn function rather than a regular sgn function could prove 
more advantageous. Assuming that the “squashed” sgn function has the shape 
as in Figure 1.2, notice that now the outputs take values in the range (—1,1) 
and are generally more discemible than in the previous case. Using units with 
continuous characteristics offers tremendous opportunities for new tasks that can 
be performed by neural networks. Specifically, the fine granularity of output pro- 
vides more information than the binary ±1 output of the thresholding element. 

An important new feature of networks with units having continuous charac- 
teristics is that they can be trained independent of their architecture. This has not 
been the case for many networks with units implementing a sgn function. The 
principles of this training are explained beginning in Section 2.4. Meanwhile, we 
concentrate on what trained networks could compute for us, rather than how to 



6 


ARTIFICIAL NEURAL SYSTEMS: PRELIMINARIES 



Figure 1.2 “Squashed” sgn function. 


design them. As an example, let us look at a neural network classifier applied to 
the solution of a real-world problem of biomedical signal processing. 

Neurologists often need to evaluate the electroencephalogram (EEG) sig- 
nals of a patient. The EEG waveforms contain important information about the 
patient’s abnormal brain function. The interpretation of EEG signals is used 
by neurologists when diagnosing the risk of seizures and deciding on corrective 
measures. The recordings are taken from the patient’s scalp and are multichannel. 
Up to 64 electrodes are used to monitor fully electrical impulses generated by 
the patient’s brain. 

Evaluation of the multichannel EEG data is rather difficult and therefore 
done most often by qualified neurologists. In addition, large amounts of data 
must be håndled because of the round-the-clock monitoring of each patient. To 
reduce the costs of monitoring and data interpretation, the EEG waves need to 
be processed by a computing device that provides on-line detection of signals of 
interest. One of the most important signal shapes that needs to be discriminated 
is an EEG spike which indicates an imminent epileptic seizure. 

EEG spike detection using a neural network classifier was recently developed 
(Eberhart and Dobbins 1990). Data are monitored in four channels of interest. 
The EEG waves are sampled 200 or 250 times per second within a 240-ms time 
window. This yields 48 or 60 data samples available from each channel to be 
evaluated. The data samples from each channel are fed to the neural network of 
interconnected units with squashed characteristics as in Figure 1.2. The total of 
41 units arranged in three layers does the processing of the data. Two output units 
are provided for the purpose of spike Identification. When the first output unit 
responds close to unity, this identifies a spike; when the second output yields 
high response, the network identifies the present input as nonspike. 

The network was designed by a team of signal processing engineers who 
developed the system and the design methodology, and by a team of four to 
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six neurologists who provided expert identification of spikes and nonspikes. The 
network was extensively trained using both spike and nonspike data. 

The test results for the trained network are impressive. Six series of com- 
prehensive tests were reported. First, the network was tested with spike/nonspike 
waves used earlier for training. All spikes used for training were positively 
identified by the network. Only 2 nonspikes of the total of 260 nonspikes were 
classified mistakenly as spikes. Following the test of the classification with the 
previously used training data, the classification of entirely new spike/nonspike 
wave data was performed. Again, all spikes were correctly identified. However, 
there were also some new nonspikes incorrectly classified by the network as 
spikes. Despite some number of false alarms, the performance of the network has 
been found to be significantly better than that required for practical application 
in hospitals. 

While classification of input data is of tremendous importance and will be 
revisited in more detail in subsequent chapters, let us tum our attention to other 
forms of mapping. An example of function approximation performed by a neural 
network is shown in Figure 1.3(a). The network’s input is a scalar variable x, and 
its output is a scalar function o(x). It can be seen from the figure that each of the 
n units operates with input that is equal to weighted input x plus individual bias 
that enters each of the summing nodes. Each of the n nonlinear mapping units 
produces the output. The outputs are then weighted and summed again. Then the 
nonlinear mapping is performed by the output unit and the network implements 
the function o(x). The function o(x) depends on a number of weights and biases 
as marked on the figure. 

Notice that the network configuration shown in Figure 1.3(a) can be used to 
approximate a function. Figure 1.3(b) illustrates a specific example of approxi- 
mation. The dashed line denotes the function that needs to be approximated and 
is known at P example points only. The exact function is 

h( x) = 0.8 sin ttx, - 1 < x < 1 (1 .2) 

Choosing P — 21, we obtain the data for function approximation as shown 
by the circles on the dashed curve. 

Approximation of function h(x) as in Equation (1.2) through function o(x) 
can be implemented using the network from Figure 1.3(a) for n = 10. Parameters 
of the neural network have been found as a result of the training using 21 points 
shown as training data. The reader may notice that the number of parameters for 
the discussed network is 31, with 20 of them being weights and 11 being bias 
levels. The mapping o(x) performed by the network is shown by a continuous 
line. Although the two functions h(x) and ø(;t) do not exactly overlap, the example 
shows the potential that exists for function approximation using neural networks. 

While neural networks for function approximation seem useful for the pur- 
pose of mathematical modeling, we may need a more convincing argument for 
how networks of this class may be used in practical applications. A sample 
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(Weight values are not marked X 

on the figure) Accurate function 


(Weight values are not marked X 

on the figure) Accurate function 

(a) NetWork function 

-*-*-*- Test points for approximation 
- O - O - O - Training points for network 
V = 0.4 S™ = 0.02 

(a) (b) 


Figure 1.3 Neural network with continuous units as function approximator: (a) block diagram 
and (b) example approximation of Equation (1 .2). 


network designed for the task of road following can offer us additional insight 
into the applied nature and potential of artificial neural systems. Consider, for 
example, ALVINN (Autonomous Land Vehicle In a Neural Network) project 
reported by Pomerleau (1989). The ALVINN network takes road images from a 
camera and a laser range finder as input. It produces as output the direction the 
vehicle should travel in order to follow the road. Let us look at some interesting 
details of this system. 

The architecture of ALVINN is shown in Figure 1.4. Inputs to the network 
consist of video and range information. The video information is provided by 
the 30 X 32 retina, which depicts the road scene. The resulting 960 segments 
of the scene are each coded into input proportional to the intensity of blue color. 
The blue band of the color image is used because it provides the highest contrast 
between the road and nonroad. The distance information is available at the input 
from a second retina consisting of 8 X 32 units and receiving the signal from a 
laser range finder. The resulting 256 inputs are each coded proportionally to the 
proximity of the corresponding area in the image. An additional single unit is 
made available in the network to indicate whether the road is lighter or darker 
than the nonroad in the previous image. 
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Road Intensity 45 Direction 

Feedback Unit Output Units 



30x32 Video 
Input Retina 


Figure 1.4 Architecture of autonomous vehicie driver. SOURCE: (Pomerleau 1989) ©Morgan- 
Kaufmann; reprinted with permission. 


The 1217 inputs are fed to the layer consisting of 29 units with squashed 
sgn characteristics. Each of the 29 units is connected to each of the 46 output 
units. A single output unit provides feedback input information about the road 
intensity, and the remaining 45 units are directional. The middle unit indicating 
high output gives the signal to drive straight ahead. Units to the left and right 
represent the curvature of the necessary tum that the vehicie should perform to 
stay at the road center. The extreme left and right units correspond to the sharpest 
left and right tums, respectively. 

ALVINN has been developed using computer-generated road images. The 
images have been used as training exemplars. The training set of different 
road images has involved 1200 different combinations of scenes, curvatures, 
lighting conditions and distortion levels. The actual neural network driver has 
been implemented using an on-board computer and a modified Chevy van. Per- 
formance of the network as reported by Pomerleau (1989) has been comparable 
to that achieved by the best traditional vision-based navigation systems evaluated 
under similar conditions. 

It seems that the salient image features important for accurate driving have 
been acquired by ALVINN not from the programmer or the algorithm, but 
from the data environment presented to the network. As a result, the system 
mastered the art of driving within a half-hour training session. It would take 
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many months of algorithm development and parameter tuning related to vision 
and pattem recognition work to develop a conventional autonomous driver of 
comparable efficiency. 

Interesting observations were reported following the development of the 
project. The network performed much better after it had been trained to re- 
cover after making mistakes during driving. This training apparently provided 
the network with instructions on how to develop corrective driving measures. 
Presently, the project is airried toward dealing more sensibly with road forks and 
intersections. 


Simple Memory and 
Restoration of Patterns 

This part of our neural network discussion introduces networks that re- 
spond, in time, to a presented pattem. Since they do so in a very characteristic 
way through a gradual reconstruction of a stored pattem, we call such networks 
simply memories. Only introductory discussion is presented below, with more in- 
depth coverage of memory analysis and design techniques to follow in Chapters 5 
and 6. 

Let us begin with an analysis of a very simple network as shown in Figure 
1.5(a). The network consists of three units computing values of sgn('), three 
summing nodes, and six weights of values +1 or —1. Signals passing through 
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Figure 1.5 Simple neural network memory: (a) network diagram and (b) listing of updates. 
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weights are multiplied by the weight values. Assume that the network is initial- 
ized at its output and the initial output is o x = 1, o 2 = 1, o 3 = 1. When the 
network is allowed to compute, we notice that inputs to units 1 and 2 are exactly 
zero, while the input to unit 3 is equal to —2. As a result, o x and o 2 do not 
update their respective values since sgn (0) is an undetermined value. However, 
o 3 updates to -1 after it leaves the initial State of 1. Therefore, single updating 
activity across the network results in o l = o 2 = 1 and o 3 = — 1. 

The next computing cycle does not bring about any change since inputs to 
units 1, 2, and 3 are 2, 2, and -2, respectively, thus yielding o x — o 2 = 1 and 
o 3 = — 1 again. The listing of discussed updates is shown in Figure 1.5(b) as 
cases 1 and 2. Transitions from initial output o x = — 1, o 2 = 1, o 3 = —1, and 
o x — 1, o 2 = —1, o 3 = —1 are also displayed as cases 3 and 4, respectively. 
Cases 1,3, and 4 result in a single unit update and they all terminate in the same 
final output of o x — 1, o 2 = 1, and ø 3 = —1. In addition, analysis of case 2 
indicates that this output is, in faet, a terminal one, no matter how many update 
cycles are allowed to take place thereafter. 

Figure 1.6 provides a geometrical interpretation of updates performed by the 
simple memory network from Figure 1.5. It can be seen that P 6 ( 1, 1,-1) is the 
stable output of the discussed memory. When a single entry of the initializing 
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Figure 1.7 Memory network applications: (a) images of three mechanical parts and (b) images 
converted to bit-map forms. 


binary output vector does not agree with P 6 , the network responds with the 
corrected entry. This output is sustained thereafter. 

The memory concept demonstrated in this example can be easily extended 
to real- world applications. Assume that three images as shown in Figure 1.7(a) 
need to be stored in a memory network. They are a nut, a bolt, and a wrench as 
display ed on the 20 X 16 grid. The network, which can store any bit maps placed 
on the grid shown, has 320 units, each computing the sgn function. Essentially, 
the network is an extension of memory from Figure 1.5(a). The first preparatory 
step that needs to be taken is to represent the actual images through grid elements 
that are entirely white or black. In this way we obtain a set of three approximated 



images. Examples of such images obtained from Figure 1.7(a) are illustrated in 
Figure 1.7(b). 

A neural network memory can now be designed using the methods explained 
in Chapter 6. The memory holds each of the three binary pattems corresponding 
to bit maps from Figure 1.7 (b). Once the memory is initialized at its output, it 
starts computing. After the evolution through intermediate States, the memory 
should terminate at one of its originally stored outputs. Figure 1.8 illustrates 
a sequence of outputs when a distorted image of the wrench is used to ini- 
tialize the computation. The sequence consists of three snapshots taken during 
the computation performed by the memory network. It can be seen that the 
correct wrench shape is restored after a number of correctional transitions. This 
application indicates the potential of neural networks memory to restore heavily 
distorted images. 
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Optimizing Networks 

Optimization is one of the most important objectives of engineering. The 
purpose of optimization is to minimize certain cost functions, usually defined 
by the user. Neural computation methods are successfully applied, among others, 
to the solution of some optimization problems. A more formal discussion of 
neural computation for optimization is covered in Chapter 5. Below we discuss 
only introductory concepts. 

Assume that an analog value x, 0 ^ x ^ 3.5, needs to be converted to a 
two-bit binary number v x v 0 such that 

x = 2v { + v 0 (1.3) 

where v 1? v 0 = 0, 1. Obviously, for the conditions stated in Equation (1.3) four 
possible solutions to the problem exist: 00, 01, 10, 11. A simple network similar 
to the one depicted in Figure 1.5(a) can be designed to solve the conversion 
problem stated in Equation (1.3). However, the network consists of only two 
units described by “squashed” characteristics, each responding in the range (0, 1). 
The block diagram of the network is shown in Figure 1.9. In addition to the two 
units, the network contains a number of interconnecting elements not shown on 
the figure. 



Neural network 


Figure 1.9 Block diagram of a two-bit A/D converter. 
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(a) (b) 

Figure 1.10 Energy function for a two-bit A/D conversion problem: (a) x = 0 and (b) x = 1. 


Note that the conversion expressed by Equation (1.3) corresponds to the 
minimization of analog-to-digital conversion error, where 

error = (x — 2v x — v 0 ) 2 (1.4) 

in the presence of such constraints that v x , v 0 can take values of 0 or 1. This 
constrained minimization problem can be fit into a certain class of neural net- 
works. The class has the unique property such that the so-called energy function 
in these networks undergoes minimization while the network is computing. 

The property of energy minimization is of great importance. It States that 
the network seeks out, by itself, its energy minimum and then setties there. A 
number of optimization problems can be translated directly into the minimiza- 
tion of a neural network ’s energy function. Once the translation is accomplished 
and the neural network designed accordingly, the optimization task is transferred 
to the neural network for actual solution. Minimization of energy by the net- 
work can be considered analogous to the minimization of error expressed by 
Equation (1.4). 

Figure 1.10 illustrates two cases of minimization of the energy function 
£(v 0 ,vi). Figure 1.1 0(a) shows that for input x = 0 the energy minimum is 
located at v 0 = v x = 0. Furthermore, the network output should always stabilize 
at that comer, since the energy function is monotonic. Similarly, convergence of 
the network output is guaranteed toward v 0 = 1 and v x = 0 for input x = 1 . 
This can be seen from Figure 1.10(b). In actual networks, output approaches a 
comer but never reaches it since 0 < v 0 < 1 and 0 < v x < 1. 
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The class of neural networks exemplified here using a simple two-bit analog- 
to-digital (A/D) converter is very useful for the solution of a number of prob- 
lems. In particular, it is useful for the solution of combinational optimization 
tasks. They are characterized by NI or e^ possible solutions of problems that 
are of size N. Under stand ably, evaluation of the cost function at each potential 
solution point of such problems can be very computationally intensive. Therefore, 
other ways of combinational optimization are studied, and neural networks are 
among them. 

One of the problems that can be successfully solved by this class of net- 
works is called job- shop scheduling. Job-shop scheduling is a resource allocation 
problem, subject to allocation and sequencing constraints. The resources are typ- 
ically machines and the jobs are the basic tasks that need to be accomplished 
using the machines. Each task may consist of several subtasks related by certain 
precedence restrictions. The job-shop problem is solved when the starting times 
of all jobs are determined and the sequencing rules are not violated. The cost 
function to be minimized is defined for this problem as a sum of the starting 
times of all jobs subject to compliance with precedence constraints (Foo and 
Takefuji 1988). Problems that are somewhat simpler but belong to the same 
group of optimization tasks include scheduling classrooms to classes, hospital 
patients to beds, workers to tasks they can accomplish best, etc. 


Clustering and Feature Detecting Networks 

A rather important group of neural networks can be used for detecting 
clusters of data. Such networks are tuned to certain similarity aspects that are of 
interest in the data being evaluated. For example, we may be interested in group- 
ing certain measurement results together to suppress any systematic errors that 
may have occurred during measurements. We may also be interested in detecting 
regularly appearing components of inputs. These components may indicate the 
true signal components as opposed to the noise that is random and would not 
form any clusters. 

Clustering and feature detecting networks exhibit remarkable properties of 
self-organization. These properties are closely related to the formation of knowl- 
edge representation in artificial intelligence and information theory. The networks 
of this class usually have a simple architecture; considerable subtleties arise 
mainly during the self-organization process, which is discussed in more detail in 
Chapter 7. 

Another important function of neural networks can be feature detection. Fea- 
ture detection is usually related to the dimensionality reduction of data. Some 
networks provide impressive planar mapping of features that are a result of multi- 
dimensional inputs of a fairly complex structure. Some of the many applications 
reported are discussed in Chapters 7 and 8 and one is especially intriguing. It 
involves mapping speech features as described below. 
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Figure 1.11 Phonotopic map of the Finnish word humpilla as an example of extracted features 
of speech. SOURCE: (Kohonen, 1988) © IEEE; reprinted with permission. 


Speech processing studies indicate that speech signal consists of 15 fre- 
quency channels within the audio-frequency band. Elementary components of 
speech, called phonemes, could thus be display ed in the 15 -dimensional spectral 
space of speech signals. The problem with such representation of phonemes is 
that it is of no use for us because our powers of visualization are limited to 
three dimensions. Through appropriate neural network mapping, however, it is 
possible to “flatten” the 15 -dimensional Spectrum of speech onto a constrained 
planar array. Figure 1.11 shows a so-called “phonotopic” map of a spoken word. 
The word humpilla has been used to produce the map. The rectangle is filled 
with units (not shown) that are selectively activated during speech. The sequence 
of word phonemes forms a trajectory that provides an impressive visualization of 
the continuous speech signal. Phonotopic maps may be invaluable for building 
phonetic typewriters, for speech training, and for therapy. Profoundly deaf people 
could have visual feedback from their speech. 


1.2 ir .i. —i 

HISTORY OF 
ARTIFICIAL NEURAL 
SYSTEMS DEVELOPMENT 

Artificial neural systems development has an interesting history. Since it is 
not possible to cover this history in depth in a short introductory chapter, only 
major achievements are mentioned. This glimpse at the field’s past milestones 
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should provide the reader with an appreciation of how contributions to the field 
have led to its development over the years. The historical summary below is not 
exhaustive; some milestones are omitted and some are mentioned only briefly. 

The year 1943 is often considered the initial year in the development of 
artificial neural systems. McCulloch and Pitts (1943) outlined the first formal 
model of an elementary computing neuron. The model included all necessary 
elements to perform logic operations, and thus it could function as an arithmetic- 
logic computing element. The implementation of its compact electronic model, 
however, was not technologically feasible during the era of bulky vacuum tubes. 
The formal neuron model was not widely adopted for the vacuum tube comput- 
ing hardware description, and the model never became technically significant. 
However, the McCulloch and Pitts neuron model laid the groundwork for future 
developments. 

Influential researchers of that time suggested that research in design of 
brain-like processing might be interesting. To quote John von Neumann’s (1958) 
observations on the “brain language’’: 

We have now accumulated sufficient evidence to see that whatever 
language the central nervous system is using, it is characterized by less 
logical and arithmetical depth than what we are normally used to. The 
following is an obvious example of this: the retina of the human eye 
performs a considerable reorganization of the visual image as perceived 
by the eye. Now this reorganization is effected on the retina, or to be 
more precise, at the point of entry of the optic nerve by means of 
the successive synapses only, i.e. in terms of three consecutive logical 
steps. The statistical character of the message system used in the arith- 
metics of the central nervous system and its low precision, described 
earlier, cannot proceed very far in the message systems involved. Con- 
sequently, there exist here different logical structures from the ones we 
are ordinarily used to in logics and mathematics. They are character- 
ized by less logical and arithmetical depth than we are used to under 
otherwise similar circumstances. Thus logic and mathematics in the 
central nervous system, when viewed as languages, must structurally 
be essentially different from those languages to which our common 
experience refers. 

Donald Hebb (1949) first proposed a leaming scheme for updating neuron’s 
connections that we now refer to as the Hebbian leaming rule. He stated that the 
information can be stored in connections, and postulated the leaming technique 
that had a profound impact on future developments in this field. Hebb’s leaming 
rule made primary contributions to neural networks theory. 

During the 1950s, the first neurocomputers were built and tested (Minsky 
1954). They adapted connections automatically. During this stage, the neuron-like 
element called a perceptron was invented by Frank Rosenblatt in 1958. It was a 
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trainable machine capable of leaming to classify certain pattems by modifying 
connections to the threshold elements (Rosenblatt 1958). The idea caught the 
imagination of engineers and scientists and laid the groundwork for the basic 
machine leaming algorithms that we still use today. 

In the early 1960s a device called AD ALINE (for ADAptive LINEar com- 
biner) was introduced, and a new, powerful leaming rule called the Widrow-Hoff 
leaming rule was developed by Bernard Widrow and Marcian Hoff (1960, 1962). 
The rule minimized the summed square error during training involving pattem 
classification. Early applications of ADALINE and its extensions to MADALINE 
(for Many ADALINES) include pattem recognition, weather forecasting, and 
adaptive Controls. The monograph on leaming machines by Nils Nilsson (1965) 
clearly summarized many of the developments of that time. That book also for- 
mulates inherent limitations of leaming machines with modifiable connections. 

t)espite the successes and enthusiasm of the early and mid-1960s, the ex- 
isting machine leaming theorems of that time were too weak to support more 
complex computational problems. Although the bottlenecks were exactly iden- 
tified in Nilsson’s work and the neural network architectures called layered 
networks were also known, no efficient leaming schemes existed at that time 
that would circumvent the formidable obstacles (see Section 4.1). Neural net- 
work research entered into the stagnation phase. Another reason that contributed 
to this research slowdown at that time was the relatively modest computational 
resources available then. 

The final episode of this era was the publication of a book by Marvin Min- 
sky and Seymour Papert (1969) that gave more doubt as to the layered leaming 
networks’ potential. The stated limitations of the perceptron-class networks were 
made public; however, the challenge was not answered until the mid-1980s. The 
discovery of successful extensions of neural network knowledge had to wait 
until 1986. Meanwhile, the mainstream of research flowed toward other areas, 
and research activity in the neural network field, called at that time cybernetics, 
had sharply decreased. The artificial intelligence area emerged as a dominant and 
promising research field, which took over, among others, many of the tasks that 
neural networks of that day could not solve. 

During the period from 1965 to 1984, further pioneering work was accom- 
plished by a handful of researchers. The study of leaming in networks of thresh- 
old elements and of the mathematical theory of neural networks was pursued by 
Sun-Ichi Amari (1972,1977). Also in Japan, Kunihiko Fukushima developed a 
class of neural network architectures known as neocognitrons (Fukushima and 
Miyaka 1980). The neocognitron is a model for visual pattem recognition and is 
concemed with biological plausibility. The network emulates the retinal images 
and processes them using two-dimensional layersof neurons. 

Associative memory research has been 'pursued by, among others, Tuevo 
Kohonen in Finland (1977, 1982, 1984, 1988) and James A. Anderson (Anderson 
et al. 1977). Unsupervised leaming networks were developed for feature mapping 
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into regular array s of neurons (Kohonen 1982). Figure 1.11 demonstrates one of 
the practical uses of this class of networks. Stephen Grossberg and Gail Carpenter 
have introduced a number of neural architectures and theories and developed the 
theory of adaptive resonance networks (Grossberg 1974, 1982). 

During the period from 1982 until 1986, several seminal publications were 
published that significantly furthered the potential of neural networks. The era of 
renaissance started with John Hopfield (1982, 1984) introducing a recurrent neural 
network architecture for associative memories. His papers formulated computa- 
tional properties of a fully connected network of units. Figures 1.5 through 1.10 
illustrate basic concepts, examples and applications of fully connected networks 
originally introduced by Hopfield. 

Another revitalization of the field came from the publication in 1986 of 
two volumes on parallel distributed processing, edited by James McClelland and 
David Rumelhart (1986). The new leaming rules and other concepts introduced 
in this work have removed one of the most essential network training barriers 
that grounded the mainstream efforts of the mid-1960s. The publication by Mc- 
Clelland and Rumelhart opened a new era for the once-underestimated computing 
potential of layered networks. The function approximator, EEG spike detector and 
autonomous driver discussed in the previous section provide examples facilitated 
by the new leaming mies. 

Although the mathematical framework for the new training scheme of lay- 
ered networks was discovered in 1974 by Paul Werbos, it went largely unnoticed 
at that time (Werbos 1974). According to the most recent statement (Dreyfus 
1990), the first authors of the optimization approach for multilayer feedforward 
systems were Bryson (Bryson and Ho 1969) and Kelley (Kelley 1969) who 
obtained a gradient solution for multistage network training. In 1962, Dreyfus 
used a simple, new recursive derivation based on the chain-rule of differentiation 
to prove the Bryson-Kelley results and dealt explicitly with the optimal control 
problem in its discrete-stage form (Dreyfus 1962). Their work, however, has not 
been carried to maturity and adopted for neural network leaming algorithms. 
Most of the developments covered in this historical overview are looked at in 
detail in subsequent chapters. 

Beginning in 1986-87, many new neural networks research programs were 
initiated. The intensity of research in the neurocomputing discipline can be mea- 
sured by a quickly growing number of conferences and journals devoted to the 
field. In addition to many edited volumes that contain collections of papers, 
several books have already appeared. The list of applications that can be solved 
by neural networks has expanded from small test-size examples to large practical 
tasks. Very-large-scale integrated neural network chips have been fabricated. At 
the time of this writing, educational offerings have been established to explore the 
artificial neural systems science. Although neurocomputing has had an interesting 
history, the field is still in its early stages of development. 
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1.3 

FUTURE OUTLOOK 

The current explosion of interest in artificial neural systems is based on a 
number of scientific and economic expectations. One should be aware, however, 
that some of the popular expectations may be exaggerated and are not real i Stic. 
The tendency for overexpectations comes to the field of artificial neural systems 
naturally through its name and, sometimes, through the terminology it uses. It is 
somewhat tempting for a human being to envisage a human-like machine with 
human or near-human perceptual, reasoning, and movement capabilities. Anthro- 
pomorphic terminology, inspirations, and the sensory processing applications of 
neural network technology may be blamed partially for linking such human-like 
man-made machines with what is merely a technical discipline. Such “links” 
should be dismissed with good humor. 

We can be quite sure that neural networks will not replace conventional 
computers. Basic reasons preclude neurocomputers replacing conventional ones 
for most tasks. Conventional computers are now very inexpensive to make. They 
are extremely fast and accurate for executing mathematical subroutines, text pro- 
cessing, computer-aided design, data processing and transfer, and many other 
tasks. In general, conventional digital computers outperform any other computing 
devices in numerical calculations. 

One possibility for the use of artificial neural systems is to simulate physical 
systems that are best expressed by massively parallel networks. Also, low-level 
signal processing might best be done by parallel analog or analog and digital 
neural networks. Perhaps the most likely applications for neural networks will be 
those involving classification, association, and reasoning rather than sequential 
arithmetic computing. Not surprisingly, the best hope for the widespread use of 
artificial neural systems, or neurocomputing, is in computationally intensive areas 
that are not successfully attacked by conventional computers. It seems that the 
areas requiring human-like inference and perception of speech and vision are 
the most likely candidates for applications. If these succeed, there will be more 
applications in real-time control of complex systems and other applications that 
we cannot yet anticipate. 

Neural networks are also expected to be widely applied in expert systems 
and in a variety of signal processors. At present, such systems are available as 
aids for medical diagnosis, financial services, stock price prediction, solar flare 
forecasting, radar pulse identification, and other applications. As most researchers 
agree, future artificial neural systems are not going to replace computational and 
artificial intelligence simulations on conventional computers either. Rather, they 
will offer a complementary technology. The ultimate goal may be to exploit both 
technologies under the same roof, while presenting a single, flexible interface to 
the user. 
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Today’s neurocomputers are often conventional programmable computers 
running neural simulation software. Such simulated neurocomputers provide suit- 
able but conventional input and output interfaces. Some neurocomputers employ 
dedicated architectures, or neurocomputing boards, that make the neural net- 
work simulations operate faster and more accurately when compared to standard 
architectures. Understandably, as neurocomputers become faster, smaller, and 
more efficient, the limits of applications will expand beyond simulated solutions. 
Futhermore, it seems certain that the proliferation of application-specific very- 
large-scale integrated (VLSI) neural networks is of crucial importance for the 
long-term success of the technology. 

Fortunately, artificial neural systems fabrication follows the invented com- 
puting paradigms. Optical neural network implementations are expected to be- 
come directly applicable. Since 1986, when AT&T’s first neural integrated Circuit 
memory was fabricated, we have seen the proliferation of microelectronic neural 
network chips. VLSI neural systems with modifiable weights have reached the 
market. However, general-purpose trainable neural network chips with on-chip 
leaming are probably several years off. 

In the long term, we could expect that artificial neural systems will be used 
in applications involving vision, speech, decision-making, and reasoning, but 
also as signal processors such as filters, detectors, and quality control systems. 
Applications are expected especially for processing large amounts of data. Also, 
neural networks may offer solutions for cases in which a processing algorithm 
or analytical solutions are hard to find, hidden, or nonexistent. Such cases in- 
clude modeling complex processes, extracting properties of large sets of data, 
and providing identification of plants that need to be controlled. A robot system 
already exists that combines vision with robot arm motion control and leams 
hand-eye coordination. Such systems can be expected to be extended over the 
next few years to produce a sophisticated robotic system. Continuous, speaker- 
independent speech recognition systems are being developed. They are likely to 
be several years off, but the limited vocabulary systems may be commercially 
available sooner. 

Artificial neural networks technology is still very new and is developing 
quickly. We are witnessing fast expansion of neural network-based intelligent 
machines. Hopefully, this expansion will enhance the quality of our lives and 
make many difficult tasks easier to accomplish. Further speculations may not 
be needed, because the technological achievements and available products will 
speak for themselves. 
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FUNDAMENTAL CONCEPTS 
AND MODELS OF ARTIFICIAL 
NEURAL SYSTEMS 

Nothing happens in the universe 
that does not have a sense of either 
certain maximum or minimum. 

L. EULER 

2.1 Biological Neurons and Their Artificial Models 

2.2 Models of Artificial Neural Networks 

2.3 Neural Processing 

2.4 Learning and Adaptation 

2.5 Neural NetWork Learning Rules 

2.6 OverView of Neural Networks 

2.7 Concluding Remarks 

T here are a number of different answers possible to the question of how to 
define neural networks. At one extreme, the answer could be that neural 
networks are simply a class of mathematical algorithms, since a network can be 
regarded essentially as a graphic notation for a large class of algorithms. Such 
algorithms produce solutions to a number of specific problems. At the other end, 
the reply may be that these are synthetic networks that emulate the biological 
neural networks found in living organisms. In light of today’s limited knowledge 
of biological neural networks and organisms, the more plausible answer seems 
to be doser to the algorithmic one. 

In search of better solutions for engineering and computing tasks, many 
avenues have been pursued. There has been a long history of interest in the 
biological Sciences on the part of engineers, mathematicians, and physicists en- 
deavoring to gain new ideas, inspirations, and designs. Artificial neural networks 
have undoubtedly been biologically inspired, but the close correspondence be- 
tween them and real neural systems is still rather weak. Vast discrepancies exist 
between both the architectures and capabilities of artificial and natural neural 
networks. Knowledge about actual brain functions is so limited, however, that 
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there is little to guide those who would try to emulate them. No models have 
been successful in duplicating the performance of the human brain. Therefore, 
the brain has been and still is only a metaphor for a wide variety of neural 
network configurations that have been developed (Durbin 1989). 

Despite the loose analogy between artificial and natural neural systems, we 
will briefly review the biological neuron model. The synthetic neuron model will 
subsequently be defined in this chapter and examples of network classes will 
be discussed. The basic definitions of neuron and elementary neural networks 
will also be given. Since no common standards are yet used in the technical 
literature, this part of the chapter will introduce notation, graphic symbols, and 
terminology used in this text. The basic forms of neural network processing will 
also be discussed. 

The reader may find this initially surprising, but the majority of this book’s 
content is devoted to network leaming and, specifically, to experiential training. 
In this chapter basic leaming concepts are characterized and defined. While the 
exposition of artificial neural network leaming is developed gradually here and 
throughout subsequent chapters, the reader should find it intriguing and appealing 
to discover how powerful and useful leaming techniques can be. Since neural net- 
works represent a rich class of leaming algorithms and architectures, a taxonomy 
of the most important networks is presented. 


2.1 _____ 

BIOLOGICAL NEURONS AND 
THEIR ARTIFICIAL MODELS 

A human brain consists of approximately 10 11 computing elements called 
neurons. They communicate through a connection network of axons and synapses 
having a density of approximately 10 4 synapses per neuron. Our hypothesis re- 
garding the modeling of the natural nervous system is that neurons communicate 
with each other by means of electrical impulses (Arbib 1987). The neurons 
operate in a Chemical environment that is even more important in terms of ac- 
tual brain behavior. We thus can consider the brain to be a densely connected 
electrical switching network conditioned largely by the biochemical processes. 
The vast neural network has an elaborate structure with very complex intercon- 
nections. The input to the network is provided by sensory receptors. Receptors 
deliver stimuli both from within the body, as well as from sense organs when the 
stimuli originate in the extemal world. The stimuli are in the form of electrical 
impulses that convey the information into the network of neurons. As a result of 
information processing in the central nervous systems, the effectors are controlled 
and give human responses in the form of diverse actions. We thus have a three- 
stage system, consisting of receptors, neural network, and effectors, in control of 
the organism and its actions. 
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A lucid, although rather approximate idea, about the information links in 
the nervous system is shown in Figure 2.1. As we can see from the figure, the 
information is processed, evaluated, and compared with the stored information in 
the central nervous system. When necessary, commands are generated there and 
transmitted to the motor organs. Notice that motor organs are monitored in the 
central nervous system by feedback links that verify their action. Both intemal 
and extemal feedback control the implementation of commands. As can be seen, 
the overall nervous system structure has many of the characteristics of a closed- 
loop control system. 


Biological Neuron 

The elementary nerve cell, called a neuron , is the fundamental budding block 
of the biological neural network. Its schematic diagram is shown in Figure 2.2. A 
typical cell has three major regions: the cell body, which is also called the soma , 
the axon, and the dendrites. Dendrites form a dendritic tree, which is a very fine 
bush of thin fibers around the neuron’s body. Dendrites receive information from 
neurons through axons — long fibers that serve as transmission lines. An axon is 
a long cylindrical connection that carries impulses from the neuron. The end part 
of an axon splits into a fine arborization. Each branch of it terminates in a small 
endbulb almost touching the dendrites of neighboring neurons. The axon-dendrite 
contact organ is called a synapse. The synapse is where the neuron introduces its 
signal to the neighboring neuron. The signals reaching a synapse and received 
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Nerve Impulse 
Train 


Figure 2.2 Schematic diagram of a neuron and a sample of pulse train. (Adapted from (Day- 
hoff 1990), © Van Nostrand Reinhoid; with permission.) 


by dendrites are electrical impulses. The intemeuronal transmission is sometimes 
electrical but is usually effected by the release of Chemical transmitters at the 
synapse. Thus, terminal boutons generate the Chemical that affects the receiving 
neuron. The receiving neuron either generates an impulse to its axon, or produces 
no response. 

The neuron is able to respond to the total of its inputs aggregated within a 
short time interval called the period of latent summation. The neuron’s response is 
generated if the total potential of its membrane reaches a certain level. The mem- 
brane can be considered as a shell, which aggregates the magnitude of the incom- 
ing signals over some duration. Specifically, the neuron generates a pulse response 
and sends it to its axon only if the conditions necessary for firing are fulfilled. 
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Let us consider the conditions necessary for the firing of a neuron. Incom- 
ing impulses can be excitatory if they eause the firing, or inhibitory if they 
hinder the firing of the response. A more precise condition for firing is that the 
excitation should exceed the inhibition by the amount called the threshold of 
the neuron, typically a value of about 40 mV (Arbib 1987). Since a synaptic 
connection causes the excitatory or inhibitory reactions of the receiving neuron, 
it is practical to assign positive and negative unity weight values, respectively, 
to such connections. This allows us to reformulate the neuron’s firing condition. 
The neuron fires when the total of the weights to receive impulses exceeds the 
threshold value during the latent summation period. 

The incoming impulses to a neuron can only be generated by neighboring 
neurons and by the neuron itself. Usually, a certain number of incoming impulses 
are required to make a target cell fire. Impulses that are closely spaced in time and 
arrive synchronously are more likely to cause the neuron to fire. As mentioned 
before, observations have been made that biological networks perform temporal 
integration and summation of incoming signals. The resulting spatio-temporal 
processing performed by natural neural networks is a complex process and much 
less structured than digital computation. The neural impulses are not synchro- 
nized in time as opposed to the synchronous discipline of digital computation. 
The characteristic feature of the biological neuron is that the signals generated do 
not differ significantly in magnitude; the signal in the nerve fiber is either absent 
or has the maximum value. In other words, information is transmitted between 
the nerve cells by means of binary signals. 

After carrying a pulse, an axon fiber is in a State of complete nonexcitability 
for a certain time called the refractory period. For this time interval the nerve 
does not conduct any signals, regardless of the intensity of excitation. Thus, we 
may divide the time scale into consecutive intervals, each equal to the length of 
the refractory period. This will enable a discrete-time description of the neurons’ 
performance in terms of their States at discrete time instances. For example, we 
can specify which neurons will fire at the instant k+ 1 based on the excitation 
conditions at the instant k. The neuron will be excited at the present instant if the 
number of excited excitatory synapses exceeds the number of excited inhibitory 
synapses at the previous instant by at least the number T , where T is the neuron ’s 
threshold value. 

The time units for modeling biological neurons can be taken to be of the 
order of a millisecond. However, the refractory period is not uniform over the 
cells. Also, there are different types of neurons and different ways in which they 
connect. Thus, the picture of real phenomena in the biological neural network 
becomes even more involved. We are dealing with a dense network of inter- 
connected neurons that release asynchronous signals. The signals are then fed 
forward to other neurons within the spatial neighborhood but also fed back to 
the generating neurons. 


30 


FUNDAMENTAL CONCEPTS AND MODELS OF ARTIFICIAL NEURAL SYSTEMS 


The above discussion is extremely simplified when seen from a neurobio- 
logical point of view, though it is valuable for gaining insight into the principles 
of “biological computation.” Our computing networks are far simpler than their 
biological counterparts. Let us examine an artificial neuron model that is of 
special, historical significance. 


McCulloch-Pitts Neuron Model 

The first formal definition of a synthetic neuron model based on the highly 
simplified considerations of the biological model described in the preceding sec- 
tion was formulated by McCulloch and Pitts (1943). The McCulloch-Pitts model 
of the neuron is shown in Figure 2.3a. The inputs x h for i = 1, 2, are 0 

or 1, depending on the absence or presence of the input impulse at instant k . The 
neuron’s output signal is denoted as o. The firing rule for this model is defined 


as follows 



1 if i WfX* > T 
i= 1 

0 if I WjX? < T 

i=i 

i 





Figure 2.3 McCulloch-Pitts model neuron and elementary logic networks: (a) model diagram, 
(b) NOR gate, (c) NAND gate, and (d) memory cell. 
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where superscript k = 0, 1,2,... denotes the discrete-time instant, and is the 
multiplicative weight connecting the fth input with the neuron’s membrane. In 
further discussion, we will assume that a unity delay elapses between the instants 
k and k + 1. Note that w ( = +1 for excitatory synapses, w t = —1 for inhibitory 
synapses for this model, and T is the neuron’s threshold value, which needs to 
be exceeded by the weighted sum of signals for the neuron to fire. 

Although this neuron model is very simplistic, it has substantial computing 
potential. It can perform the basic logic operations NOT, OR, and AND, pro- 
vided its weights and thresholds are appropriately selected. As we know, any 
multivariable combinational function can be implemented using either the NOT 
and OR, or altematively the NOT and AND, Boolean operations. Examples of 
three-input NOR and NAND gates using the McCulloch-Pitts neuron model are 
shown in Figure 2.3(b) and (c). The reader can easily inspect the implemented 
functions by compiling a truth table for each of the logic gates shown in the 
figure. 

Both the neuron model and the example logic circuits discussed so far 
have been combinational and little attention has been paid to the inherent delay 
involved in their operation. However, the unity delay property of the McCulloch- 
Pitts neuron model makes it possible to build sequential digital circuitry. 

First note that a single neuron with a single input x and with the weight and 
threshold values both of unity, computes o k+l = x k . Such a simple network thus 
behaves as a single register cell able to retain the input for one period elapsing 
between two instants. As a consequence, once a feedback loop is closed around 
the neuron as shown in Figure 2.3(d), we obtain a memory cell. An excitatory 
input of 1 initializes the firing in this memory cell, and an inhibitory input of 1 
initializes a nonfiring State. The output value, at the absence of inputs, is then 
sustained indefinitely. This is because the output of 0 fed back to the input does 
not cause firing at the next instant, while the output of 1 does. 

Thus, we see that digital computer hardware of arbitrary complexity can be 
constructed using an artificial neural network consisting of elementary budding 
blocks as shown in Figure 2.3. Our purpose, however, is not to duplicate the 
function of already efficient digital circuitry, but rather to assess and exploit the 
computational power that is manifested by interconnected neurons subject to the 
experiential leaming process. 


Neuron Modeling for Artificial 
Neural Systems 

The McCulloch-Pitts model of a neuron is characterized by its formalism 
and its elegant, precise mathematical definition. However, the model makes use 
of several drastic simplifications. It allows binary 0, 1 States only, operates under 
a discrete-time assumption, and assumes synchrony of operation of all neurons 
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Multiplicative 

weights 


Figure 2.4 General symbol of neuron consisting of processing node and synaptic connections. 

in a larger network. Weights and the neurons’ thresholds are fixed in the model 
and no interaction among network neurons takes place except for signal flow. 
Thus, we will consider this model as a starting point for our neuron modeling 
discussion. Specifically, the artificial neural systems and computing algorithms 
employ a variety of neuron models that have more diversified features than the 
model just presented. Below, we introduce the main artificial neuron models that 
will be used later in this text. 

Every neuron model consists of a processing element with synaptic input 
connections and a single output. The signal flow of neuron inputs, x h is consid- 
ered to be unidirectional as indicated by arrows, as is a neuron’s output signal 
flow. A general neuron symbol is shown in Figure 2.4. This symbolic represen- 
tation shows a set of weights and the neuron’s processing unit , or node. The 
neuron output signal is given by the following relationship: 

o =/( w r x), or (2.1a) 

0=/(j>^) (2.1b) 

where w is the weight vector defined as 

W = [w x w 2 ••• w n \ 

and x is the input vector: 



(All vectors defined in this text are column vectors; superscript t denotes a 
transposition.) The function /( w r x) is often referred to as an activation function. 
Its domain is the set of activation values, net , of the neuron model, we thus often 
use this function as f{net). The variable net is defined as a scalar product of the 
weight and input vector 

A t 

net = w x 


( 2 . 2 ) 
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The argument of the activation funetion, the variable net , is an analog of the bio- 
logical neuron’s membrane potential. Note that temporarily the threshold value is 
not explicitly used in (2.1) and (2.2), but this is only for notational convenience. 
We have momentarily assumed that the modeled neuron has n — 1 actual synaptic 
connections that come from actual variable inputs x x , x 2 , . . . , x n -i- We have also 
assumed that x n — — 1 and w n = T. Since threshold plays an important role for 
some models, we will sometimes need to extract explicitly the threshold as a 
separate neuron model parameter. 

The general neuron symbol, shown in Figure 2.4 and described with expres- 
sions (2.1) and (2.2), is commonly used in neural network literature. However, 
different artificial neural network classes make use of different definitions of 
f(net). Also, even within the same class of networks, the neurons are sometimes 
considered to perform differently during different phases of network operation. 
Therefore, it is pedagogically sound to replace, whenever needed, the general 
neuron model symbol from Figure 2.4 with a specific f(net) and a specific neu- 
ron model. The model validity will then usually be restricted to a particular class 
of network. Two main models introduced below are often used in this text. 

Acknowledging the simplifications that are necessary to model a biolog- 
ical neuron network with artificial neural networks, we introduce in this text 
the following terminology: (1) neural networks are meant to be artificial neural 
networks consisting of neuron models and (2) neurons are meant to be artificial 
neuron models as defined in this chapter. 

Observe from (2.1) that the neuron as a processing node performs the oper- 
ation of summation of its weighted inputs, or the scalar product computation to 
obtain net. Subsequently, it performs the nonlinear operation f(net) through its 
activation funetion. Typical activation funetions used are 


f(net) = 


2 

1 + exp(— A net) 


- 1 


(2.3a) 


and 


f(net) = sgn (net) 


+ 1 , 

- 1 . 


net > 0 
net < 0 


(2.3b) 


where A > 0 in (2.3a) is proportional to the neuron gain determining the steep- 
ness of the continuous funetion f(net ) near net = 0. The continuous activation 
funetion is shown in Figure 2.5(a) for various A. Notice that as A — ► <», the 
limit of the continuous funetion becomes the sgn (net) funetion defined in (2.3b). 
Activation funetions (2.3 a) and (2.3b) are called bipolar continuous and bipolar 
binary funetions, respectively. The word “bipolar” is used to point out that both 
positive and negative responses of neurons are produced for this definition of the 
activation funetion. 

By shifting and scaling the bipolar activation funetions defined by (2.3), 
unipolar continuous and unipolar binary activation funetions can be obtained, 
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respectively, as 


and 


f(net) = 


1 

1 + exp(— knet) 



net > O 
net < O 


(2.4a) 

(2.4b) 


Function (2.4a) is shown in Figure 2.5(b). Again, the unipolår binary function 
is the limit of f(net) in (2.4a) when A —> oo. The soft-limiting activation func- 
tions (2.3a) and (2.4a) are often called sigmoidal characteristics , as opposed to 
the hard-limiting activation functions given in (2.3b) and (2.4b). Hard-limiting 
activation functions describe the discrete neuron model. 


Most neurons in this text employ bipolar activation functions. Some neu- 
ral network architectures or applications do, however, specifically require the 
unipolår neuron responses. If this is the case, appropriate qualification for the 
type of activation function used is made. Essentially, any function f(net) that is 
monotonically increasing and continuous such that net e R and f (net) e (—1,1) 
can be used instead of (2.3a) in neural modeling. A few neural models that often 
involve some form of feedback require the use of another type of nonlinearity 
than that defined in (2.3) and (2.4). An example activation function that is a 
unipolår ramp is shown in Figure 2.6. 

If the neuron’s activation function has the bipolar binary form of (2.3b), the 
symbol of Figure 2.4 can be replaced by the diagram shown in Figure 2.7(a), 
which is actually a discrete neuron functional block diagram showing summation 
performed by the summing node and the hard-limiting thresholding performed 
by the threshold logic unit (TLU). This model consists of the synaptic weights, 
a summing node, and the TLU element. 
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Figure 2.6 Unipolar ramp activation function. 


In the case of the continuous activation function as in (2.3a), the model 
used is shown in Figure 2.7(b). The neuron is depicted as a summing high- 
gain saturating amplifier which amplifies its input signal w f x. The models in 
Figure 2.7(a) and (b) can be called discrete (binary) and continuous perceptrons , 
respectively. The discrete perceptron, introduced by Rosenblatt (1958), was the 
first leaming machine. It can be viewed as a precursor of many of the neural 
network models that we use today. Also, its study provides considerable insight 
into the nature of artificial neural systems. 

As has been shown, neuron outputs are either discrete (binary) or continuous. 
Given a layer of m neurons, their output values o l9 o 2 , . . . , o m can be arranged 
in a layer ’s output vector: 

o=[°i ø 2 ••• o m ]' (2.5) 

where o { is the output signal of the i’th neuron. The domains of vectors o are 
defined in m-dimensional space as follows for / = 1, 2, . . . , m (Hecht-Nielsen 
1990): 

(-l,ir = {oe/r, Qi €e(-U)} (2.6a) 

or 

(0, l) m = {o e R n \ o t e (0, 1)} (2.6b) 

for bipolar and unipolar continuous activation functions defined as in (2.3a) and 
(2.4a), respectively. It is evident that the domain of vector o is, in this case, the 
interior of either the m-dimensional cube (— 1, l) m or of the cube (0, Y) m . 
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Weights 

(b) 


Figure 2.7 Common models of neurons with synaptic connections: (a) hard-limiting neuron 
(binary perceptron) and (b) soft-limiting neuron (continuous perceptron). 


For binary-valued outputs o h the domains of o in m-dimensional space for 
i = 1, 2, . . . , m, are 

Hlf^oer, 0i e {-1,1}} (2.7a) 

or 

{O f ir-{o 6 *" f o, g {0,1}} (2.7b) 

for bipolar and unipolar binary activation functions defined as in (2.3b) and 
(2.4b), respectively. The domains of vector o are in this case vertices of either 

the m-dimensional cube [— 1,1 ] m or of the cube [0,1]™. Each vector o having 

binary-valued entries can thus assume 2 m different values. Input vectors x and 
their domains can be described similarly. 

Artificial neural systems using the models defined by (2.1) through (2.4) do 
not involve the biological neuron features of delay, refractory period, or discrete- 
time operation. In faet, the neuron models listed so far in this section represent 
instantaneous, memoryless networks; i.e. they generate the output response de- 
termined only by the present excitation. A delay feature can be added to the 
instantaneous neuron model by adding an extemal delay element to make the 
ensemble of neurons operate with memory, as shown in the next section. 
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2.2 i— 

MODELS OF ARTIFICIAL 
NEURAL NETWORKS 

Our introductory definition of artificial neural networks as physical cellular 
networks that are able to acquire, store, and utilize experiential knowledge has 
been related to the network’s capabilities and performance. At this point, know- 
ing the definition of the artificial neural network neuron model, we may benefit 
from another definition. The neural network can also be defined as an intercon- 
nection of neurons, as defined in (2.1) through (2.4), such that neuron outputs 
are connected, through weights, to all other neurons including themselves; both 
lag-free and delay connections are allowed. As research efforts continue, new 
and extended definitions may be developed, but this definition is sufficient for 
the introductory study of artificial neural architectures and algorithms found in 
this text. 


Feedforward Network 


Let us consider an elementary feedforward architecture of m neurons re- 
ceiving n inputs as shown in Figure 2.8(a). Its output and input vectors are, 
respectively 


0 =[ 0 ! 0 2 ••• O m ]‘ 

x =[*l *2 ••• X n ]‘ 


( 2 . 8 ) 


Weight w t j connects the /’th neuron with the fth input. The double subscript con- 
vention used for weights in this book is such that the first and second subscript 
denote the index of the destination and source nodes, respectively. We thus can 
write the activation value for the /’th neuron as 

n 

net i = X WijXj, for / = 1, 2, . . . , m (2.9) 

j = i 


The folio wing nonlinear transformation [Equation (2.10)] involving the activa- 
tion fxmction f(netj), for i = 1, 2, m, completes the processing of x. The 
transformation, performed by each of the m neurons in the network, is a strongly 
nonlinear mapping expressed as 


Oj =/( w'x), for I = 1, 2 m (2.10) 

where weight vector w,- contains weights leading toward the /’th output node and 
is defined as follows 

w, = [w,i w a w in }‘ (2.11) 

Introducing the nonlinear matrix operator T, the mapping of input space x to 
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Figure 2.8 Single-layer feedforward network: (a) interconnection scheme and (b) block diagram. 


output space o implemented by the network can be expressed as follows 

o = T[Wx] (2.12a) 


where W is the weight matrix, also called the connection matrix: 


and 


w n w 12 •• 

• w ln~ 

w 2l w 22 •• 

■ w 2n 

w ml w m2 ■ ■ 

■ Wmn. 

'/(•) 0 • 

•• 0 ' 

0 /(•) • 

•• 0 

0 0 - 

••/(•). 


(2.12b) 


(2.12c) 


Note that the nonlinear activation functions /(•) on the diagonal of the matrix 
operator 1 operate componentwise on the activation values net of each neuron. 
Each activation value is, in tum, a scalar product of an input with the respective 
weight vector. 

The input and output vectors x and o are often called input and output 
patterns, respectively. The mapping of an input pattem into an output pattem as 
shown in (2.12) is of the feedforward and instantaneous type, since it involves no 
time delay between the input x, and the output o. Thus, we can rewrite (2.12a) 
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in the explicit form involving time t as 

o(r) = r[Wx(t)] (2.13) 

Figure 2.8(b) shows the block diagram of the feedforward network. As can 
be seen, the generic feedforward network is characterized by the lack of feedback. 
This type of network can be connected in Cascade to create a multilayer network. 
In such a network, the output of a layer is the input to the following layer. Even 
though the feedforward network has no explicit feedback connection when x(t) 
is mapped into o (0, the output values are often compared with the “teacher’s” 
information, which provides the desired output value, and also an error signal 
can be employed for adapting the network’s weights. We will postpone more de- 
tailed coverage of feedforward network leaming through adaptation to Sections 
2.4 and 2.5. 


EXAMPLE 2.1 

This example presents the analysis of a two-layer feedforward network using 
neurons having the bipolar binary activation function given in (2.3b). The 
network to be analyzed Ts showh lri Figure 2.9(a). t!hir puipose is to find 
output o 5 for a given network and input pattem. Each of the network layers 
is described by the formula (2.12a) 

o = T[Wx] (2.14) 

By inspection of the network diagram, we obtain the output, input vectors, 
and the weight matrix for the first layer, respectively, as 

o = Oi o 2 o 3 o 4 ]* 

X = [xj x 2 -1]' 

1 0 1 

-1 0 -2 

0 1 o 

0 -1 -3 

Similarly, for the second layer we can write 
o = [o 5 ] 

x=[oj o 2 o 3 o 4 -1]' 

W 2 =[l 1 1 1 3.5] 

The response of the first layer can be computed for bipolar binary activation 
functions as below 

o = [ sgn (xj - 1) sgn(-X!+2) sgn(x 2 ) sgn(-x 2 + 3)]' 
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Neuron #1 



(b) 


Figure 2.9a, b Example of two-iayer feedforward network: (a) diagram and (b) two-dimensional 
space mapping (discrete activation function). 

Let us discuss the mapping performed by the first layer. Note that each 
of the neurons 1 through 4 divides the plane x l9 x 2 into two half-planes. 
The half-planes where the neurons’ responses are positive (+1) have been 
marked on Figure 2.9(b) with arrows pointing toward the positive response 
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(c) 

Figure 2.9c Example of two-layer feedforward network (continued): (c) two-dimensional space 
mapping (continuous activation function, A = 2.5). 



half-plane. The response of the second layer can be easily obtained as 
o 5 = sgn(6>j + o 2 + o 3 + o 4 — 3.5) 

Note that the fifth neuron responds + 1 if and only if o x = o 2 = o 3 = o 4 = 1. 
It therefore selects the intersection of four half-planes produced by the first 
layer and designated by the arrows. Figure 2.9(b) shows that the network 
maps the shaded region of plane x x , x 2 into o 5 = 1, and it maps its com- 
plement into o 5 = — 1. In summary, the network of Figure2.9(a) provides 
mapping of the entire x x , x 2 plane into one of the two points ± 1 on the real 
number axis. 

Let us look at the mapping provided by the same architecture but 
with neurons having sigmoidal characteristics. For the continuous bipolar 
activation function given in (2.3a), we obtain for the first layer 


1 + exp(l — Xj)A 


1 + exp (jtj - 2)A 


1 4- exp(-jc 2 )A 


L 1 + exp ( x 2 - 3)A 
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and for the second layer 

o 5 = l -1 

1 + exp(3.5 — o x — o 2 ~ o 3 - o 4 ) A 

The network with neurons having sigmoidal activation functions performs 
mapping as shown in Figure 2.9(c). The figure reveals that although some 
similarity exists with the discrete neuron case, the mapping is much more 
complex. The example shows how the two-dimensional Space has been 
mapped into the segment of one-dimensional space. In summary, the network 
of Figure 2.9(a) with bipolar continuous neurons provides mapping of the 
entire x 1? x 2 plane into the interval (-1,1) on the real number axis. Similar 
mapping was shown earlier in Figure 1.3. In faet, neural networks with as 
few as two layers are capable of universal approximation from one finite 
dimensional space to another. mm 


Feedback Network 

A feedback network can be obtained from the feedforward network shown 
in Figure 2.8(a) by connecting the neurons’ outputs to their inputs. The result is 
depicted in Figure 2.10(a). The essence of closing the feedback loop is to enable 



(a) 


73 


Instantaneous 

network 

rtwo(f)] 


> o(r+A) 


o(0 


Delay 


(b) 


Figure 2.10 Single-layer discrete-time feedback network: (a) interconnection scheme and 
(b) block diagram. 
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control of output through outputs Oj, for j = 1, 2, . . . , m. Such control is 
especially meaningful if the present output, say o(t), Controls the output at the 
folio wing instant, o (t + A). The time A elapsed between t and t + A is introduced 
by the delay elements in the feedback loop as shown in Figure 2.10(a). Here the 
time delay A has a symbolic meaning; it is an analogy to the refractory period 
of an elementary biological neuron model. Using the notation introduced for 
feedforward networks, the mapping of o (t) into o (7 + A) can now be written as 

o(f + A) = TtWo(0] (2.15) 


This formula is represented by the block diagram shown in Figure 2.10(b). Note 
that the input x(7) is only needed to initialize this network so that o(0) = x(0). 
The input is then removed and the system remains autonomous for t > 0. We 
thus consider here a special case of this feedback configuration, such that x(t) — 
jt(0) and no input is provided to the network thereafter, or for t > 0. 

There are two main categories of single-layer feedback networks. If we con- 
sider time as a discrete variable and decide to observe the network performance 
at discrete time instants A, 2A, 3A, . . . , the system is called discrete-time. For 
notational convenience, the time step in discrete-time networks is equated to 
unity, and the time instances are indexed by positive integers. Symbol A thus 
has the meaning of unity delay. The choice of indices as natural numbers is 
convenient since we initialize the study of the system at t — 0 and are interested 
in its response thereafter. For a discrete-time artificial neural system, we have 
converted (2.15) to the form 

o* +1 - r[Wo*], for k = 1, 2, ... (2.16a) 

where k is the instant number. The network in Figure 2.10 is called recurrent 
since its response at the k + 1 ’th instant depends on the entire history of the 
network starting at k = 0. Indeed, we have from (2.16a) a series of nested 
solutions as follows 

0 1 = r[Wx°] 

0 2 = r [wnwx 0 ]] 

( 2 . 16 b) 


n.*+l 


= rlwr 


L.riwx 0 ]... 


Recurrent networks typically operate with a discrete representation of data; they 
employ neurons with a hard-limiting activation function. A system with discrete- 
time inputs and a discrete data representation is called an automaton. Thus, 
recurrent neural networks of this class can be considered as automatons. 

Equations (2.16) describe what we call the State o k of the network at instants 
k — 1,2,..., and they yield the sequence of State transitions. The network begins 
the State transitions once it is initialized at instant 0 with x°, and it goes through 
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State transitions o*, for k = 1,2,..., until it possibly finds an equilibrium State. 
This equilibrium State is often called an attractor. An attractor can consist of a 
single State or a limited number of States. The example network of Figure 1.5(a) 
can be considered as having a single-state attractor of value o = [ 1 1 — 1 ]*. 

The sequence of States of a recurrent network is generally nondeterministic. In 
addition, there are often many equilibrium States that can potentially be reached 
by the network after a number of such nondeterministic transitions. These issues 
will be covered in more detail in Chapters 5 and 6. 


EXAMPLE 2.2 

This example covers the basic concept of State transitions and the analysis 
of a simple recurrent discrete-time network. Figure 2.11 shows a four- 
dimensional hypercube, the vertices of which represent four-dimensional 
bipolar binary vectors, or simply, binary numbers. The cube shown visual- 
izes State vector values and possible transitions for the four-neuron network. 
Four edges terminate at each vertex, since vectors that are different by a 
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single bit component are connected with an edge. Thus, binary numbers 
joined by any single line can be considered to differ by only one digit. 
The vertices can be thought of as the output vector domain of a certain 
four-neuron discrete-time network with bipolar binary neurons. 

Assume now that we know how to enforce the transition of States as 
marked by the arrows. We also can notice that each event of State transition 
ends at one of the two vectors as follows 


O, = [ 1 1 1 -1]' 

o 2 = [-1 -1 -1 1 ]' 


(2.17) 


which are apparently equilibria. Below is an example showing that a network 
can be devised that indeed has the equilibria as in (2.17). 

Let us analyze an example of the discrete-time recurrent network shown 
in Figure 2.12. Its weight matrix can be set up by inspection as 


W = 


0 1 
1 o 
1 1 
-1 -1 


i -r 

i -i 
o -i 
-i o 


(2.18) 


First, it is easy to notice that none of the equilibrium States of (2.17) causes 
any further transitions in the network. Indeed, substituting o x = x° in (2.16b) 
and using (2.18) results in the first recurrence 


sgn(-) 

0 

0 

0 


net® 

0 

sgn (•) 

0 

0 


net® 

0 

0 

sgn(-) 

0 


net® 

0 

0 

0 

sgn(-)_ 


net® 


(2.19a) 


or 


o 1 = [ sgn (3) sgn (3) sgn(3) sgn (-3)]' (2.19b) 

and thus no transitions take place thereafter since o 1 = o 2 = . . . = o x . The 
reader may easily verify that if o 2 = x° then o 1 = o 2 = . . . = o 2 , and no 
transitions take place either. 

Assume now that the network is initialized at a State x° = [ 1 1 1 1 ]', 
which is adjacent to o x . We may compute that the output o 1 becomes 


o 1 = [sgn(l) sgn(l) sgn(l) sgn (-3)]' 
Therefore, the transition that takes place is 


( 2 . 20 ) 


1 1 1 !]—♦[! 1 1 -1 


and the network reaches its closest equilibrium State. The reader may easily 
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Figure 2.12 Recurrent network for two-equilibrium State diagram of Figure 2.11: (a) diagram, 

(b) bit map of equilibrium State vector Oi , and (c) example bit maps that converge to the bit map 
of part (b). 


verify that for any of the initial States as below 

x°=[l 1 -1 -l]',x°=[l -1 1 -l]',or 

x° = [-1 1 1 -1]' 
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there will be subsequent transitions such that o 1 = Oj and the network will 
always settie at Oj. 

Bipolar binary vectors can be visualized in a bit-map form so that black 
and white elements of a bit map correspond to 1 and —1, respectively. The 
bit map in Figure 2.12(b) would thus represent o l . Note that the four bit 
maps of Figure 2.12(c), if represented as vectors, can be regarded as the 
initial States. Transitions from these States end at the equilibrium point 
It is seen that the initial State converges in the network of Figure 2.12(a) 
to its closest equilibrium distant only by one bit. In cases when the initial 
State differs by two bits from both equilibria, the initializing input vector is 
equidistant to both Oj and o 2 . Thus, convergence to either of the equilibria 
would be justifiable. This assumes that the network should seek out the 
output which is the most similar to the initializing binary vector. Although 
we have not shown yet how to encode the desired equilibrium States and how 
to control transitions, the discussion in this example helps in understanding 
the concept of State transitions and the analysis of recurrent networks. h 


In the discussion of the feedback concept, a discrete time delay A has been 
used between the input and output. The insertion of a unity delay element in the 
network feedback loop has made it possible to outline the formal definition of 
a recurrent network operating at discrete instants k - 1,2, . . . , upon its initial- 
ization at k = 0. To generalize the ideas just discussed, note that the feedback 
concept can also be implemented with any infinitesimal delay between output 
and input introduced in the feedback loop. The consequence of an assumption of 
such delay between input and output is that the output vector can be considered 
to be a continuous-time function. As a result, the entire network operates in 
continuous time. It can be seen that a continuous-time network can be obtained 
by replacing delay elements in Figure 2.10 with suitable continuous-time lag 
producing components. 

An example of such an elementary delay network is shown in Figure 2.13(a). 
It is a simple electric network consisting of resistance and capacitance. In faet, 
electric networks are very often used to model the computation performed by 
neural networks. Electric networks possess the flexibility to model all linear and 
nonlinear phenomena encountered in our neural network studies. Because of this 
flexibility, they will often represent working physical models of neural networks. 
The differential equation relating v 2 and v 1# the output and input voltages, re- 
spectively, is 


( 2 . 21 ) 


— 2 + Il = ZL 
dt RC RC 

The change of output voltage Av 2 occurring within the small time interval A t 
can be approximately expressed from (2.21) as follows: 

{ 2 . 22 ) 
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Figure 2.13 Feedback connection in continuous-time neural network: (a) elementary connecting 
Circuit, (b) output response due to an impulse, and (c) output wave due to a damped harmonic 
input wave. 


From (2.22) we see that for fixed C and Af, increases of output voltage v 2 are 
proportional to the difference between v l and v 2 . If v x is kept constant or is 
varied slowly with respect to the time constant RC , v 2 approximately follows 
v x in time, with a small delay. Examples of input waveforms v x and output 
responses v 2 are shown in Figures 2.13(b) and (c). Figure 2.13(b) shows v 2 (f) 
resulting from Vj(f) being an impulse excitation. Figure 2.13(c) shows v 2 (f) due 
to a damped harmonic input wave excitation. Although such a wave is not very 
likely to be observed in artificial neural systems, it shows the explicit delay of 
v 2 where v 2 lags Vj by approximately RC. 

Usually, continuous-time networks employ neurons with continuous activa- 
tion functions. An elementary synaptic connection using the delay network given 
in Figure 2.13(a) is shown in Figure 2.14. The resistance R t j serves as a weight 
from the output of the / th neuron to the input of the V th neuron. Using the finite 
time interval Af, Equation (2.21) can be discretized as 
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Figure 2.14 Elementary synaptic connection within continuous-time network. 


The activation of the V th neuron at the instant k + 1 can be expressed as 

netf +l s netf + (of - netf) (2.23b) 

As can be seen, the contribution to net { by the f th neuron is distributed in 
time according to (2.23b), where A t denotes an infinitesimal time step. When n 
neurons are connected to the input of the fth neuron as shown in Figure 2.14, 
expression (2.23b) needs to be computed for 7 = 1,2 ,...,« and summed. Numer- 
ous studies indicate that for n neurons interconnected as shown, fairly complex 
dynamic behavior of the network is usually obtained. The accurate description of 
transitions, which have been very simple for the recurrent discrete-time networks 
as demonstrated in Example 2.2, requires solving nonlinear differential equations 
for continuous-time networks. A more detailed discussion of properties of this 
class of networks is presented in Chapter 5. The introductory example below 
illustrates the main features and time-domain performance of continuous-time 
feedback networks. 


EXAMPLE 2.3 


Let us consider an electric continuous-time network of two neurons as shown 
in Figure2.15(a). The matrix form description, such as (2.12) for feedforward 
networks and (2.15) for discrete-time feedback networks, will be discussed 
in Chapter 5. In this example, we obtain the network equations by inspection 
of its diagram. We also discuss the network’s main properties in terms of 
its dynamical behavior. 
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Simulation Results 


Step 

Run 1 

Run 2 

Run 3 

Run 4 

°i 

92 

°i 

92 

Øl 

°2 

01 

02 

0 

0.245 

—0.124 

0.462 

-0.245 

0.555 

-0.555 

0.704 

-0.848 

25 

0.264 

-0.146 

0.493 

-0.284 

0.598 

-0.598 

0.749 

-0.875 

50 

0.284 

-0.170 

0.526 

-0.326 

0.642 

-0.642 

0.790 

-0.899 

75 

0.307 

-0.195 

0.560 

-0.370 

0.685 

-0.685 

0.827 

-0.919 

100 

0.332 

-0.224 

0.596 

-0.416 

0.727 

-0.727 

0.860 

-0.937 

125 

0.360 

-0.254 

0.633 

-0.464 

0.767 

-0.767 

0.888 

-0.951 

150 

0.389 

-0.287 

0.671 

-0.513 

0.804 

-0.804 

0.912 

-0.963 

175 

0.421 

-0.323 

0.708 

-0.563 

0.838 

-0.838 

0.932 

-0.973 

200 

0.456 

-0.361 

0.745 

-0.612 

0.868 

-0.868 

0.948 

-0.980 

225 

0.492 

-0.402 

0.781 

-0.661 

0.894 

-0.894 

0.961 

-0.986 

250 

0.530 

-0.445 

0.814 

-0.707 

0.916 

-0.916 

0.971 

-0.990 

275 

0.570 

-0.489 

0.845 

-0.751 

0.935 

-0.935 

0.979 

-0.993 

300 

0.611 

-0.535 

0.872 

-0.791 

0.950 

-0.950 

0.985 

-0.995 

325 

0.652 

-0.582 

0.897 

-0.828 

0.963 

-0.963 

0.989 

-0.997 

350 

0.692 

-0.628 

0.918 

-0.860 

0.972 

-0.972 

0.993 

-0.998 

375 

0.732 

-0.674 

0.936 

-0.888 

0.980 

-0.980 

0.995 

-0.999 

400 

0.770 

-0.718 

0.951 

-0.912 

0.986 

-0.986 

0.997 

-0.999 

425 

0.806 

-0.760 

0.963 

-0.931 

0.990 

-0.990 

0.998 

-0.999 

450 

0.838 

-0.798 

0.973 

-0.948 

0.993 

-0.993 

0.999 

- 1.000 

475 

0.868 

-0.833 

0.980 

-0.960 

0.995 

-0.995 

0.999 

- 1.000 

500 

0.893 

-0.864 

0.986 

-0.971 

0.997 

-0.997 

0.999 

- 1.000 


(b) 


Figure 2.15a, b Continuous-time network from Example 2.3: (a) diagram, and (b) sample itera- 
tions for Oi , ty 
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Figure 2.15c Continuous-time network from Example 2.3 (continued): (c) illustration for equilib- 
rium search. 


The neuron outputs are connected to their inputs using a simple RC 
delay network, as in Figure 2.13(a). In addition, resistances R between each 
neuron input and ground are added to account for the neurons’ finite in- 
put resistances. Such resistances model nonzero input currents absorbed by 
actual neurons. The differential equations obtained by summing currents in 
input nodes of the network being input neuron nodes are 


n d net x _ o 2 — net x net x 

1_ dr /? 12 ~ ~R~ 

^ d net 2 _ o y — net 2 net 2 

2 ~di ^ ~~ ~R~ 


(2.24) 


where o ( — f(netj), i = 1, 2, and f (net) is assumed to be given by (2.3a). 
We will also assume that C x = C 2 and R X1 = R 2 \ < 0 for this network. 
Although this version of the Circuit involves negative resistances, we will 
show in Chapter 5 that they will not be needed to build actual networks. 
Before an analysis of this network is pursued, let us try to approach the 
problem intuitively. 

Assume that this circuit has been initialized by storing a charge at the 
input capacitances, and it is allowed thereafter to seek its own equilibrium. 
Our initial assumption is that starting with observations of the network at 
t = 0, the value o x has slightly increased. This would, after a small delay A t 
introduced by the RC network, lower the value of net 2 since R 2X is negative. 
Since f(net 2 ) is a monotonically increasing function, o 2 would also decrease 
within the (0, A t) interval. This in tum would increase net x and, as a final 
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consequence, the original initial increase of o x would be enhanced due to the 
feedback interaction described. We have the cause-and-effect relationship, 
where the effect increases the initial cause in a closed, positive feedback, 
loop. If this interaction is allowed to continue for some time, the increases of 
o x will eventually slow down, since f(net) hattens out far from the origin, 
when o x — > 1”. Noticeably, as o x increases, o 2 decreases symmetrically 
toward — 1 + (a + , a~ are “right” and “left” neighborhoods, respectively, of 
the point a while the point a is excluded). This discussion indicates that the 
network of Figure 2.15(a) for high-gain neurons would seek its equilibrium, 
°r [°\ o 2 ]-+[r -1 + ], 

The correct conclusion, however, is not unique. If the initial hypothesis 
on the increase of o x is negated, the conclusion from a similar discussion as 
above is just the opposite and the network seeks its equilibrium such that 
[°i °i] [~ 1~ ] • Obviously, the network cannot simultaneously 

be seeking two different solutions and move along opposite paths. Either 
the first or the second hypothesis must be true, but it is somewhat difficult 
to pick the right one at this point. 

To gain better insight into the dynamics of the network, the equations 
(2.24) must be solved numerically since the closed-form solutions do not 
exist. Discretizing the differential equation (2.24) similar to (2.23), we obtain 


neft x x = net x + fo k — netf] 

R\2^\ ^ 

nel!i +x = net- + (o\ — net-} 

2 2 Ri\C 2 \ 1 2 > 


rv . 

—~At 

RC X 

neti A 
-—^At 
RC 2 


(2.25) 


Notice that the numerical integration method chosen in (2.25) (forward Eu- 
ler) is simple but usually not recommended for more sophisticated numerical 
computation. However, it is adequate for the purpose of this example and 
is also consistent with formulas (2.23). Sample results of simulations are 
partially listed in Figure 2.15(b) linking o x and o 2 as a function of k. Con- 
ditions chosen for the simulation are A = 2.5, dt = 0.002, R = 10, and 
R n C x = R 2X C 2 — —1. Initial conditions are listed for k = 0 in the first 
row of the table for each of the four simulations shown. All of the partially 
listed runs converge to the solution [o x o 2 ] = [ 1“ — 1 + ]. 

The schematic illustration for more general cases of the equilibrium 
search is shown in Figure 2.15(c) for — 1 < o x ^ 1 and — 1 < o 2 < 1. 
The trajectories indicate the computed solution functions o 2 (o x ) with time 
being a parameter. The line o 2 = o x separates the two regions of attractions. 
For initial condition o° x > o®, [ 1“ ~1 + ] is an equilibrium solution, other- 

wise the solution is at [ — 1 + 1” ] . Note that this network maps the entire 

two-dimensional input space into two points, which are roughly expressed 
as O! = \fl /— 45 ° and o 2 = a/2 /135 °. In contrast to feedforward net- 
works, as shown in Figure 2.9(a), feedback networks require time to provide 
response to the mapping problem. m 
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2.3 

NEURAL PROCESSING 

The material of the previous section has focused mainly on the computation 
of response o for a given input x for several important classes of neural networks. 
The analysis of three example networks shown in Figures 2.9(a), 2.12(a), and 
2.15(a) has also been discussed. The process of computation of o for a given x 
performed by the network is known as recall. Recall is the proper processing 
phase for a neural network, and its objective is to retrieve the information. Recall 
corresponds to the decoding of the stored content which may have been encoded 
in a network previously. In this section we are mainly concemed with the recall 
mode and the tasks it can accomplish. Based on the preliminary observations 
made so far, we now outline the basic forms of neural information processing. 

Assume that a set of pattems can be stored in the network. Later, if the 
network is presented with a pattem similar to a member of the stored set, it 
may associate the input with the closest stored pattem. The process is called 
autoassociation. Typically, a degraded input pattem serves as a cue for retrieval 
of its original form. This is illustrated schematically in Figure 2.16(a). The figure 
shows a distorted square recalling the square encoded. Another example of auto- 
association is provided in the Figure 1.8. 

Associations of input pattems can also be stored in a heteroassociation vari- 
ant. In heteroassociative processing, the associations between pairs of pattems 
are stored. This is schematically shown in Figure 2.16(b). A square input pattem 
presented at the input results in the rhomboid at the output. It can be inferred 
that the rhomboid and square constitute one pair of stored pattems. A distorted 
input pattem may also cause correct heteroassociation at the output as shown 
with dashed line. 

Classification is another form of neural computation. Let us assume that 
a set of input pattems is divided into a number of classes, or categories. In 
response to an input pattem from the set, the classifier is supposed to recall the 
information regarding class membership of the input pattem. Typically, classes are 
expressed by discrete-valued output vectors, and thus output neurons of classifiers 
would employ binary activation functions. The schematic diagram illustrating the 
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Figure 2.16 Association response: (a) autoassociation and (b) heteroassociation. 



54 


FUNDAMENTAL CONCEPTS AND MODELS OF ARTIFICIAL NEURAL SYSTEMS 


Input 


Class 


pattem 


number 

A > 

{AXZ7} 

'> 

-l' 

i 


/ 

i 


(a) 


Input 


Class 


pattem 


number 

X=} 

{AXZ7} 


1 

-1 


(b) 


Figure 2.17 Classification response: (a) classification and (b) recognition. 


classification response for pattems belonging to three classes is shown in Figure 
2.17(a). 

Interestingly, classification can be understood as a special case of heteroasso- 
ciation. The association is now between the input pattem and the second member 
of the heteroassociative pair, which is supposed to indicate the input’s class num- 
ber. If the network’s desired response is the class number but the input pattem 
does not exactly correspond to any of the pattems in the set, the processing is 
called recognition . When a class membership for one of the pattems in the set 
is recalled, recognition becomes identical to classification. Recognition within 
the set of three pattems is schematically shown in Figure 2.17(b). This form of 
processing is of particular significance when an amount of noise is superimposed 
on input pattems. 

One of the distinet strengths of neural networks is their ability to general- 
ize. The network is said to generalize well when it sensibly interpolates input 
pattems that are new to the network. Assume that a network has been trained 
using the data x x through jc 5 , as shown in Figure 2.18. The figure illustrates bad 
and good generalization examples at points that are new and are between the 
training points. Neural networks provide, in many cases, input-output mappings 
with good generalization capability. 



Figure 2.18 Generalization example. 
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When reviewing the different aspects of neural processing in this section, 
the assumption has been made that neural networks are able to store data. So far, 
we discussed the retrieval of network responses without covering any possible 
methods of data storage. Data are stored in a network as a result of its leaming. 
The following section introduces basic concepts of leaming. 


2.4 i 

LEARNING AND ADAPTATION 

Each of us acquires and then hones our skilis and abilities through the basic 
phenomenon of leaming. Leaming is a fundamental subject for psychologists, but 
it also underlies many of the topics in this book. In general, leaming is a relatively 
permanent change in behavior brought about by experience. Leaming in human 
beings and animals is an inferred process; we cannot see it happening directly 
and we can assume that it has occurred by observing changes in performance. 
Leaming in neural networks is a more direct process, and we typically can capture 
each leaming step in a distinet cause-effect relationship. To perform any of the 
processing tasks discussed in the previous section, neural network leaming of an 
input-output mapping from a set of examples is needed. Designing an associator 
or a classifier can be based on leaming a relationship that transforms inputs into 
outputs given a set of examples of input-output pairs. A classical framework for 
this problem is pro vided by approximation theory (Poggio and Girosi 1990). 


Leaming as Approximation or 
Equilibria Encoding 

Approximation theory focuses on approximating a continuous, multivariable 
funetion h(\) by another funetion H( w,x), where x = [x { x 2 ••• is 

the input vector and w = [ w x w 2 ••• w m ] 1 is a parameter (weight) vector. 
In the approach below, we will look at a class of neural networks as systems 
that can leam approximation of relationships. The leaming task is to find w that 
provides the best possible approximation of h(\) based on the set of training 
examples {x}. An important choice that needs to be made is which approxima- 
tion funetion H( w, x) to use. An ill-chosen, nonsmooth, approximation funetion 
example is shown in Figure 2.18 as curve (2). Even with the best choice of 
parameters for an ill-chosen funetion, the approximation is inaccurate between 
successive data points. The choice of funetion H{ w,x) in order to represent h(\) 
is called a representation problem. Once H( w, x) has been chosen, the network 
leaming algorithm is applied for finding optimal parameters w. A more precise 
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formulation of the leaming problem can be stated as calculation involving w* 
such that (Poggio and Girosi 1990): 

p [H( w*, x), h(x)] < p [H( w, x), h(x)] (2.26) 

where p[//(w,x), h(x)], or distance function, is a measure of approximation qual- 
ity between H{ w,x) and h(x). When the fit is judged according to the sum of 
squared differences taken for the set of training examples {x}, the distance has a 
form of sum of squared errors. As will be shown in Chapters 3 and 4, the feed- 
forward networks, both single-layer and multilayer, can be taught to perform the 
desired mappings as described. In this chapter, only the main leaming principles 
will be introduced. 

In contrast to feedforward networks, which store mapping that can be re- 
called instantaneously, feedback networks are dynamical systems. The mappings 
in feedback networks are encoded in the equilibrium States. Similar to approx- 
imation leaming, weights also determine the properties of feedback networks. 
Leaming in feedback networks corresponds to equilibria encoding. Usually this 
is accomplished by a so-called “recording process,” but stepwise leaming ap- 
proaches have also been developed for this class of networks. Equilibrium States 
leaming will be discussed in more detail in Chapters 5 and 6. 


Supervised and Unsupervised Learning 

Under the notion of leaming in a network, we will consider a process of 
forcing a network to yield a particular response to a specific input. A particular 
response may or may not be specified to provide extemal correction. Leaming is 
necessary when the information about inputs / outputs is unknown or incomplete a 
priori , so that no design of a network can be performed in advance. The majority 
of the neural networks covered in this text requires training in a supervised or 
unsupervised leaming mode. Some of the networks, however, can be designed 
without incremental training. They are designed by batch learning rather than 
stepwise training. 

Batch learning takes place when the network weights are adjusted in a 
single training step. In this mode of leaming, the complete set of input /output 
training data is needed to determine weights, and feedback information produced 
by the network itself is not involved in developing the network. This leaming 
technique is also called recording. Leaming with feedback either from the teacher 
or from the environment rather than a teacher, however, is more typical for neural 
networks. Such leaming is called incremental and is usually performed in steps. 

The concept of feedback plays a central role in leaming. The concept is 
highly elusive and somewhat paradoxical. In a broad sense it can be understood 
as an introduction of a pattem of relationships into the cause-and-effect path. We 
will distinguish two different types of leaming: leaming with supervision versus 
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(a) 


(b) 


Figure 2.19 Block diagram for explanation of basic learning modes: (a) supervised learning and 
(b) unsupervised learning. 


learning without supervision. The learning types block diagrams are illustrated 
in Figure 2.19. 

In supervised learning we assume that at each instant of time when the 
input is applied, the desired response d of the system is provided by the teacher. 
This is illustrated in Figure 2.19(a). The distance p[d,o] between the actual and 
the desired response serves as an error measure and is used to correct network 
parameters extemally. Since we assume adjustable weights, the teacher may im- 
plement a reward-and-punishment scheme to adapt the network’s weight matrix 
W. For instance, in learning classifications of input pattems or situations with 
known responses, the error can be used to modify weights so that the error 
decreases. This mode of learning is very pervasive. Also, it is used in many 
situations of natural learning. A set of input and output pattems called a training 
set is required for this learning mode. 

Typically, supervised learning rewards accurate classifications or associations 
and punishes those which yield inaccurate responses. The teacher estimates the 
negative error gradient direction and reduces the error accordingly. In many situ- 
ations, the inputs, outputs and the computed gradient are deterministic, however, 
the minimization of error proceeds over all its random realizations. As a result, 
most supervised learning algorithms reduce to stochastic minimization of error 
in multi-dimensional weight space. 

Figure 2.19(b) shows the block diagram of unsupervised learning. In learn- 
ing without supervision , the desired response is not known; thus, explicit error 
information cannot be used to improve network behavior. Since no information is 
available as to correctness or incorrectness of responses, learning must somehow 
be accomplished based on observations of responses to inputs that we have mar- 
ginal or no knowledge about. For example, unsupervised learning can easily result 
in finding the boundary between classes of input pattems distributed as shown 
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Figure 2.20 Two-dimensional patterns: (a) clustered and (b) no apparent clusters. 


in Figure 2.20. In a favorable case, as in Figure 2.20(a), cluster boundaries can 
be found based on the large and representative sample of inputs. Suitable weight 
self-adaptation mechanisms have to be embedded in the trained network, because 
no extemal instructions regarding potential clusters are available. One possible 
network adaptation rule is: A pattem added to the cluster has to be doser to the 
center of the cluster than to the center of any other cluster. 

Unsupervised leaming algorithms use pattems that are typically redundant 
raw data having no labels regarding their class membership, or associations. In 
this mode of leaming, the network must discover for itself any possibly exist- 
ing pattems, regularities, separating properties, etc. While discovering these, the 
network undergoes change of its parameters, which is called self-organization. 

The technique of unsupervised leaming is often used to perform clustering 
as the unsupervised classification of objects without providing information about 
the actual classes. This kind of leaming corresponds to minimal a priori infor- 
mation available. Some information about the number of clusters, or similarity 
versus dissimilarity of pattems, can be helpful for this mode of leaming. Finally, 
leaming is often not possible in an unsupervised environment, as would probably 
be true in the case illustrated in Figure 2.20(b) showing pattem classes not easily 
discemible even for a human. 

Unsupervised leaming is sometimes called leaming without a teacher. This 
terminology is not the most appropriate, because leaming without a teacher is 
not possible at all. Although, the teacher does not have to be involved in every 
training step, he has to set goals even in an unsupervised leaming mode (Tsypkin 
1973). We may think of the folio wing analogy. Leaming with supervision corre- 
sponds to classroom leaming with the teacher ’s questions answered by students 
and corrected, if needed, by the teacher. Leaming without supervision corresponds 
to leaming the subject from a videotape leeture covering the material but not 
including any other teacher ’s involvement. The teacher leetures directions and 
methods, but is not available. Therefore, the student cannot get explanations of 
unclear questions, check answers and become fully informed. 
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2.5 — — 

NEURAL NETWORK 
LEARNING RULES 

Our focus in this section will be artificial neural network leaming rules. 
A neuron is considered to be an adaptive element. Its weights are modifiable 
depending on the input signal it receives, its output value, and the associated 
teacher response. In some cases the teacher signal is not available and no error 
information can be used, thus the neuron will modify its weights based only on 
the input and /or output. This is the case for unsupervised leaming. 

Let us study the leaming of the weight vector w,-, or its components w t j 
connecting the fth input with the fth neuron. The trained network is shown 
in Figure 2.21 and uses the neuron symbol from Figure 2.4. In general, the 
fth input can be an output of another neuron or it can be an extemal input. 
Our discussion in this section will cover single-neuron and single-layer network 
supervised leaming and simple cases of unsupervised leaming. Under different 
leaming mies, the form of the neuron ’s activation function may be different. Note 
that the threshold parameter may be included in leaming as one of the weights. 
This would require fixing one of the inputs, say x n . We will assume here that 
x n , if fixed, takes the value of -1. 

The following general leaming rule is adopted in neural network studies 
(Amari 1990): The weight vector w, = [w n w i2 • • - w in ] 1 increases in pro- 
portion to the product of input x and leaming signal r. The leaming signal r is 
in general a function of w,,x, and sometimes of the teacher ’s signal d { . We thus 
have for the network shown in Figure 2.21: 

r = r(WpX,^) (2.27) 


The increment of the weight vector w, produced by the leaming step at time t 
according to the general leaming mle is 

Aw,(0 = cr [w ( (0, x(0, dj(t)} x(?) (2.28) 

where c is a positive number called the learning constant that determines the 
rate of leaming. The weight vector adapted at time t becomes at the next instant, 
or leaming step, 

W i(t + 1) = w,(0 + cr [w,(0,x(0,4(0] x(0 (2.29a) 

The superscript convention will be used in this text to index the discrete-time 
training steps as in Eq. (2.29a). For the fc’th step we thus have from (2.29a) 
using this convention 

w* +1 = w* + cr(vf k y x k ,d k )x k (2.29b) 

The leaming in (2.29) assumes the form of a sequence of discrete-time weight 
modifications. Continuous-time leaming can be expressed as 


dw,(Q 
d t 


= crx(t ) 


(2.30) 
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Figure 2.21 Illustration for weight leaming rules (of/ provided only for supervised learning 
mode). 


Discrete-time, or stepwise, leaming is reviewed below. Weights are assumed to 
have been suitably initialized before each leaming experiment started. 


Hebbian Learning Rule 

For the Hebbian leaming mle the leaming signal is equal simply to the 
neuron’s output (Hebb 1949). We have 

X r=/(w'x) (2.31) 

The increment Aw,- of the weight vector becomes 

y Aw i = c/(w'x)x (2.32a) 

The single weight is adapted using the following increment: 

Y A Wjj = cf(Wp)Xj (2.32b) 

This can be written briefly as 

A = co t Xj , for j ~ 1 , 2 ,...,« (2.32c) 

This leaming mle requires the weight initialization at small random values 
around w, = 0 prior to leaming. The Hebbian leaming mle represents a purely 
feedforward, unsupervised leaming. The mle implements the interpretation of the 
classic statement: “When an axon of cell A is near enough to excite a cell B 

and repeatedly or persistently takes place in firing it, some growth process or 
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metabolic change takes place in one or both cells such that A’s efficiency, as one 
of the cells firing B, is increased.” (Hebb 1949.) 

The rule States that if the crossproduct of output and input, or correlation 
term o { Xj is positive, this results in an increase of weight w t j\ otherwise the 
weight decreases. It can be seen that the output is strengthened in tum for each 
input presented. Therefore, frequent input pattems will have most influence at 
the neuron’s weight vector and will eventually produce the largest output. 

Since its inception, the Hebbian rule has evolved in a number of directions. 
In some cases, the Hebbian rule needs to be modified to counteract unconstrained 
growth of weight values, which takes place when excitations and responses 
consistently agree in sign. This corresponds to the Hebbian leaming rule with 
saturation of the weights at a certain, preset level. Throughout this text note 
that other leaming mies often reflect the Hebbian rule principle. Below, most of 
the leaming mies are illustrated with simple numerical examples. Note that the 
subscript of the weight vector is not used in the examples since there is only a 
single weight vector being adapted there. 


EXAMPLE 2.4 


This example illustrates Hebbian leaming with binary and continuous acti- 
vation functions of a very simple network. Assume the network shown in 
Figure 2.22 with the initial weight vector 

r i i 


0.5 


needs to be trained using the set of three input vectors as below 



1 

*-h <N 
1 

t 


" 1 ' 
-0.5 


1 

0 

1 

II 

1.5 

o 

X 

KJ 

II 

-2 

-1.5 

> x 3 = 

1 

_ 1 
l/l 

1 


for an arbitrary choice of leaming constant c = 1. Since the initial weights 
are of nonzero value, the network has apparently been trained before- 
hand. Assume first that bipolar binary neurons are used, and thus f(net) = 
sgn (net). 

Step 1 Input Xj applied to the network results in activation net 1 as below: 

r i i 



o 


net 1 = w 1( x, = [l -1 0 0.5] 
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The updated weights are 

w 4 = w 3 + sgn (net 3 )x 3 = w 3 — x 3 


1 

-3.5 

4.5 

0.5 


It can be seen that leaming with discrete f(net) and c — 1 results in 
adding or subtracting the entire input pattem vectors to and from the weight 
vector, respectively. In the case of a continuous f(net ), the weight incre- 
menting/decrementing vector is scaled down to a fractional value of the 
input pattem. 

Revisiting the Hebbian leaming example, with continuous bipolar ac- 
tivation function f(net ), using input x 1 and initial weights w 1 , we obtain 
neuron output values and the updated weights for A = 1 as summarized in 
Step 1. The only difference compared with the previous case is that instead 
of f (net) = sgn (net), now the neuron’s response is computed from (2.3a). 

Step 1 


f(net l ) = 0.905 


w 


2 


' 1.905' 
-2.81 
1.357 
0.5 


Subsequent training steps result in weight vector adjustment as below: 


Step 2 


/( net 2 ) = -0.077 


w 


1.828' 

-2.772 

1.512 

0.616 


Step 3 


/( net 3 ) = 


-0.932 

' 1.828' 
-3.70 
2.44 
-0.783 


Comparison of leaming using discrete and continuous activation func- 
tions indicates that the weight adjustments are tapered for continuous f(net) 
but are generally in the same direction. ■■ 
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Perceptron Learning Rule 

For the perceptron learning rule, the learning signal is the difference be- 
tween the desired and actual neuron’s response (Rosenblatt 1958). Thus, learning 
is supervised and the learning signal is equal to 

r = dj — o, (2.33) 

where o { = sgn (w'x), and d t is the desired response as shown in Figure 2.23. 
Weight adjustments in this method, Aw, and A are obtained as follows 

Aw i = c [d { - sgn (w|x)] x (2.34a) 

A Wy = c - sgn (w'x)] x p for j = 1, 2, . . . , n (2.34b) 

Note that this rule is applicable only for binary neuron response, and the relation- 
ships (2.34) express the rule for the bipolar binary case. Under this rule, weights 

are adjusted if and only if o ( is incorrect. Error as a necessary condition of 

learning is inherently included in this training rule. Obviously, since the desired 
response is either 1 or —1, the weight adjustment (2.34a) reduces to 

Aw i = ±2cx (2.35) 

where a plus sign is applicable when d { = 1 , and sgn (w'x) = - 1 , and a minus 
sign is applicable when d t = -1, and sgn (w'x) = 1. The reader should notice 
that the weight adjustment formula (2.35) cannot be used when d t = sgn (w'x). 
The weight adjustment is inherently zero when the desired and actual responses 
agree. As we will see throughout this text, the perceptron learning rule is of 
central importance for supervised learning of neural networks. The weights are 
initialized at any values in this method. 



Figure 2.23 Perceptron learning rule. 
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EXAMPLE 2.5 


This example illustrates the perceptron leaming rule of the network shown 
in Figure 2.23. The set of input training vectors is as follows: 



11 


' 0 ' 


'-1 ' 

*1 = 

O 

1 

. *2 = 

1.5 

-0.5 

• *3 = 

1 

0.5 


L — i J 


-1 


-1 


and the initial weight vector w 1 is assumed identical as in Example 2.4. The 
leaming constant is assumed to be c = 0.1. The teacher’s desired responses 
for Xj, x 2 , x 3 are c/, = — 1 , d 2 = — 1, and d 3 = 1, respectively. The leaming 
according to the perceptron leaming rule progresses as follows. 

Step 1 Input is Xj, desired output is : 


net 1 = w''X| = [ 1 -1 0 


1 



-1 


= 2.5 


Correction in this step is necessary since d 1 # sgn (2.5). We thus obtain 
updated weight vector 


w 2 = w 1 + 0. 1 ( — 1 - l)x, 
Plugging in numerical values we obtain 



' i ' 


' r 


' 0.8' 

w 2 = 

-i 

-0.2 

-2 


-0.6 

0 

0 


0 


0.5 


.-1. 


0.7. 


Step 2 Input is x 2 , desired output is d 2 . For the present weight vector w 2 
we compute the activation value net 2 as follows: 


net 2 


w 2 , x 2 = [0 1.5 - 0.5 -1 


0 . 8 ' 

- 0.6 

0 

0.7 


- 1.6 


Correction is not performed in this step since d 2 = sgn (—1.6) 

Step 3 Input is x 3 , desired output is d 3 , present weight vector is w 3 . 
Computing net 3 we obtain: 


0 . 8 ' 

- 0.6 

0 

0.7 


net 3 = w 3 , x 3 = [ — 1 1 0.5 - 1 


= - 2.1 
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Correction is necessary in this step since d 3 -=h sgn(— 2.1). The updated 
weight values are 

w 4 = w 3 + o m + i ) X3 

or 



' 0.8 ' 


-1 " 


' 0.6 ‘ 

4 

-0.6 

+ 0.2 

1 


-0.4 

w 4 = 

0.5 

= 

0.1 

0 



0.7 


-1 


0.5. 


This terminates the sequence of leaming steps unless the training set is 
recycled. It is not a coincidence that the fourth component of x 1? x 2 , and x 3 
in this example is invariable and equal to -1. Perceptron leaming requires 
fixing of one component of the input vector, although not necessarily at the 
— 1 level. 

The reader may ask what this training has achieved, if anything. At this 
point the answer can be very preliminary. The real objectives of training 
and its significance will require more thorough study later in this text. Our 
preliminary answer for now is that if the same training set is resubmitted, 
the respective output error should be, in general, smaller. Since the binary 
perceptron does not provide fine output error information, we may look at 
its input. The reader may verify it to see that net 4 — 0.9 obtained after three 
training steps as a response to resubmitted pattem x x is doser to net < 0 
than the initial activation value in the first training step, net 1 = 2.5. By 
observing successive net values for continued training we would see the 
trend of improving answer s. We show in Chapter 3 that, eventually, the 
network trained in this mode will stop committing any mistakes. m 


Delta Learning Rule 

The delta leaming rule is only valid for continuous activation functions as 
defined in (2.3a),| (2.4a), and in the supervised training mode. The leaming signal 
for this rule is called delta and is defined as follows 

r = [dj ~/(w'x)]/'(w'x) (2.36) 

The term/'(w[x) is the derivative of the activation function f{net) computed for 
net = w'x. The explanation of the delta leaming rule is shown in Figure 2.24. 
This leaming rule can be readily derived from the condition of least squared 
error between o { and d r Calculating the gradient vector with respect to w / of the 
squared error defined as 

A 1 ^ 

E = 2^ ~ 0i) 


(2.37a) 
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Figure 2.24 Delta learning rule. 


which is equivalent to 


E = \ -/(w'x )] 2 

(2.37b) 

we obtain the error gradient vector value 


V£ = -(dj - o,)/'(w'x)x 

(2.38a) 

The components of the gradient vector are 


= -(d t ~ Oi)f'( w'x)x-, for j = 1 , 2, . . . , n 

dwa 

(2.38b) 


Since the minimization of the error requires the weight changes to be in the 
negative gradient direction, we take 

Aw z = —rjVE (2.39) 

where 17 is a positive constant. We then obtain from Eqs. (2.38a) and (2.39) 

Aw; = 7 ] (d t — o^f'inet^x (2.40a) 

or, for the single weight the adjustment becomes 

A wy = 17 (df - ø,)/' (net, ).*,•, for j = 1, 2, . . . , n (2.40b) 

Note that the weight adjustment as in (2.40) is computed based on minimization 
of the squared error. Considering the use of the general learning rule (2.28) 
and plugging in the learning signal as defined in (2.36), the weight adjustment 
becomes 

Aw, = c(d-- ø,)/'(net,)x (2.41) 

Therefore, it can be seen that (2.41) is identical to (2.40), since c and 7] have 
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been assumed to be arbitrary constants. The weights are initialized at any values 
for this method of training. 

The delta rule was introduced only recently for neural network training 
(McClelland and Rumelhart 1986). This rule parallels the discrete perceptron 
tr ainin g rule. It also can be called the continuous perceptron training rule. The 
delta leaming rule can be generalized for multilayer networks. This will be 
discussed in more detail in Chapter 3 and extended in Chapter 4. 


EXAMPLE 2.6 


This example discusses the delta leaming rule as applied to the network 
shown in Figure 2.24. Training input vectors, desired responses, and initial 
weights are identical to those in Example 2.5. The delta leaming requires 
that the value f'(net) be computed in each step. For this purpose, we can 
use the following result derived in Equations (3.48) through (3.51): 

/'(net) = 1(1 - o 2 ) 

valid for the bipolar continuous activation function. The result is useful since 
it expresses the slope of the activation function through the neuron’s output 
signal. Using (2.41) for the arbitrarily chosen leaming constant c = 0.1, and 
A = 1 for the bipolar continuous acitivation function f(net) as in (2.3a), the 
delta rule training can be summarized as folio ws. 

Step 1 Input is vector x 1? initial weight vector is w 1 : 

net 1 = w u X! = 2.5 
o 1 = f (net 1 ) = 0.848 

/W) = i[l - (o 1 ) 2 ] = 0.140 
w 2 = c(d x — o l )f'(net l )x x + w 1 
- [0.974 -0.948 0 0.526]' 

Step 2 Input is vector x 2 , weight vector is w 2 : 
net 2 = w 2 'x 2 = —1.948 
o 2 = f(net 2 ) = —0.75 

/'(net 1 ) = 1[1 - (o 2 ) 2 ] = 0.218 

w 3 = c(d 2 - o 2 )f'(net 2 )x 2 + w 2 

= [0.974 -0.956 0.002 0.531]' 
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Step 3 Input is x 3 , weight vector is w 3 : 

net 3 = w 3 'x 3 = -2.46 
o 3 = f (net 3 ) = -0.842 

/'(net 3 ) = 1[1 - (o 3 ) 2 ] = 0.145 
w 4 = c(d 3 - o 3 )f\net 3 )x 3 + w 3 

= [0.947 -0.929 0.016 0.505]' 

Obviously, since the desired values are ±1 in this example, the cor- 
rections will be performed in each step, since d { — f {net t ) =£ 0 throughout 
the entire training. This method usually requires small c values, since it is 
based on moving the weight vector in the weight Space in the negative error 
gradient direction. mm 


Widrow-Hoff Learning Rule 

The Widrow-Hoff learning rule (Widrow 1962) is applicable for the super- 
vised training of neural networks. It is independent of the activation function of 
neurons used since it minimizes the squared error between the desired output 

value dj and the neuron ’s activation value net l = wjx. The learning signal for 

this rule is defined as follows 

r = d t - w'x (2.42) 

The weight vector increment under this learning rule is 

Aw; = c(dj - wjx)x (2.43a) 

or, for the single weight the adjustment is 

A Wjj = c(dj - WjX)Xj, for j = 1, 2, . . . , n (2.43b) 

This rule can be considered a special case of the delta learning rule. Indeed, 
assuming in (2.36) that /(w[x) = wjx, or the activation function is simply the 
identity function f(net) = net , we obtain f{net) — 1, and (2.36) becomes identi- 
cal to <2.42). This rule is sometimes called the LMS ( least mean square) learning 
rule. Weights are initialized at any values in this method. 


Correlation Learning Rule 

By substituting r = d t into the general learning rule (2.28) we obtain the 
correlation learning rule. The adjustments for the weight vector and the single 
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weights, respectively, are 

Aw,- = cd ( x 

A = cd ( x p for j = 1, 2, . . . , n 


(2.44a) 

(2.44b) 


This simple rule States that if d { is the desired response due to x- p the correspond- 
ing weight increase is proportional to their product. The rule typically applies 
to recording data in memory networks with binary response neurons. It can be 
interpreted as a special case of the Hebbian rule with a binary activation func- 
tion and for o { = d r However, Hebbian leaming is performed in an unsupervised 
environment, while correlation leaming is supervised. While keeping this basic 
difference in min# , we can observe that Hebbian rule weight adjustment (2.32a) 
and correlation rule weight adjustment (2.44a) become identical. Similar to Heb- 
bian leaming, this leaming rule also requires the weight initialization w = 0. 


Winner-Take-AII Learning Rule 

This leaming mle differs substantia].ly from any of the mies discussed so 
far in this section. It can orily be démonstrated and explained for an ensemble 
of neurons, preferably arranged in a layer of p units. This mle is an example of 
competitive leaming, and it is used for unsupervised network training. Typically, 
winner-take-all lemming is used for leaming statistical properties of inputs (Hecht- 
Nielsen 1987). The leaming is based on the premise that one of the neurons in 
the layer, say the m’th, has the maximum response due to input x, as shown 
in Figure 2.25. This neuron is declared the winner. As a result of this winning 
event, the weight vector w m 

= [vv ml w m2 w mn }‘ (2.45) 



Figure 2.25 Competitive unsupervised “winner-take-all” leaming rule. 
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Oy'u 1 ^ 

containing weights highlighted in the figure is the only one adjusted in the given 
unsupervised leaming step. Its increment is computed as follows 

Aw m = a(x - w m ) (2.46a) 

or, the individual weight adjustment becomes 

Aw mj = a{xj - w mj ), for j = 1, 2, . . . , n (2.46b) 

where a > 0 is a small leaming con|tant, typically decreasing as leaming pro- 
gresses. The winner selection is based on the following criterion of maximum 
activation among all p neurons participating* in a competition:^ .^ 

w‘ x = ; <max (w'x) (2.47) 

i = 1,2,...,/? 

As shown later in Chapter 7,. this criterion coiresponds to finding the weight 
vector that is closest to the ( input x. The rule (2.46) then reduces to incrementing 
w m by a fraction of x — w m . Note that only the winning neuron fan-in weight 
vector is adjusted. After the adjustment, its fan-in weights tend to better esti- 
mate the input' pattem in question. In this method, the winning neighborhood 
is sometimes extended beyond the single neuron winner so that it includes the 
neighboring neurons. Weights are typically initialized at random values and their 
lengths are normalized during leaming in this method. More detailed justification 
and application of this rule is pro vided in Chapter 7. 


Outstar Learning Rule 

Outstar leaming rule is another leaming rule that is best explained when 
neurons are arranged in a layer. This rule is designed to produce a desired re- 
sponse d of the layer of p neurons shown in Figure 2.26 (Grossberg 1974, 1982). 
The rule is used to provide leaming of repetitive and characteristic properties 
of input /output relationships. This rule is concemed with supervised leaming; 
however, it is supposed to allow the network to extract statistical properties of 
the input and output signals. The weight adjustments in this mle are computed 
as follows 

Awy = /3(d - w j) (2.48a) 

or, the individual weight adjustments are 

Aw m] = fi(d m - w mj ), for m = 1, 2, . . . , p (2.48b) 

Note that in contrast to any leaming mle discussed so far, the adjusted weights 
are fanning out of the / th node in this leaming method and the weight vector 
in (2.48a) is defined accordingly as 

A f 

W; = [Wy W 2j W pj 
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Arbitrarily selectable (3 is a small positive leaming constant decreasing during 
training. The rule typically ensures that the output pattem becomes similar to the 
undistorted desired output after repetitively applying (2.48) on distorted output 
versions. More detailed discussion and application of the outstar leaming mle is 
presented in Chapter 7. 


Summary of Learning Rules 

Table 2.1 provides the summary of leaming rules and of their properties. 
Seven leaming rules are tabulated and compared in terms of the single weight 
adjustment formulas, supervised versus unsupervised leaming mode, weight ini- 
tialization, and required neuron activation functions. Also, reference is made in 
the table regarding the conditions of leaming. Most leaming mies allow for 
leaming of a single, isolated neuron’s weight with exception of winner-take-all 
and outstar mies. Both of these mies require a layer of neurons in order for the 
weight adaptation to proceed. 

So far we have assumed that desired output values are precisely known 
for each input pattem when a network is trained in the supervised mode. But 
in some situations fewer details are available. When only partial information is 
available about the correctness of the network’s response, we have so-called re- 
inforcement leaming. In the extreme case, only “right” or “wrong” information is 
available to provide instmction to the network. The amount of feedback received 
from the environment is very limited in such cases since the teacher’s signal is 
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TABLE 2.1 

Summary of learning rules and their properties. 


Learning 

rule 

Single weight 
adjustment 

kWjj 

Initial 

weights 

Learning 

Neuron 

characteristics 

Neuron 
/ Layer 

Hebbian 

COjXj 

j = 1, 2, n 

0 

u 

Any 

Neuron 

Perceptron 

c [di - sgn (w'x)] xj 
j = 1, 2, n 

Any 

s 

Binary 
bipolar, or 
Binary 
unipolar 

Neuron 

Delta 

c(dj - Oi)f'(netj)Xj 
j = 1. 2, n 

Any 

s 

Continuous 

Neuron 

Widrow-Hoff 

c(d i — w'x); kj 
j = 1, 2, n 

Any 

s 

Any 

Neuron 

Correlation 

cdiXj 

j = 1, 2, n 

0 

s 

Any 

Neuron 

Winner-take-all 

Aw mj = a(xj - w mj ) 
m-winning neuron 
number 

j = 1, 2, ..., n 

Random 

Normalized 

u 

Continuous 

Layer of 
p neurons 

Outstar 

P(dj - wij) 
i = 1, 2, p 

0 

s 

Continuous 

Layer of 
p neurons 


c, a, p are positive learning constants 
S— supervised learning, U — unsupervised learning 
* —Awjj not shown 


rather evaluative than instructive. The reinforcement learning is sometimes called 
“learning with a critic” as opposed to learning with a teacher (Hertz, Krogh, and 
Palmer 1991). 

The coverage of the seven most important learning rules in this section is 
by no means exhaustive. Only those rules that are used later in the text have 
been covered. However, a serious reader may at this point raise a question about 
validity, convergence, or practical importance of the rules. Fortunately, each of 
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the leaming rules has its meaning and its own mathematical substantiation and 
applicability. Indeed, neural networks can be trained using these rules to perform 
useful tasks such as classification, recognition, or association. The outcome of 
the training is as usual for any training: Results are for the most part successful, 
but failures are also sometimes possible; however, they are the exception, rather 
than the rule. The exposition in later chapters provides more detailed answers to 
questions about why neural networks are capable of leaming. We will also study 
how the network leaming assists in solving problems that otherwise would have 
been difficult to solve. 


OVERVIEW OF 
NEURAL NETWORKS 

The review of introductory concepts of artificial neural systems in this 
chapter considers the main aspects of neural processing. Performance of neural 
networks during recall and their basic leaming techniques are studied. We also 
address some elementary information processing tasks that neural networks can 
solve. The review would not be complete, however, without an overview and 
taxonomy of artificial neural systems architectures. 

More than a dozen specific architectures are covered in this text. Let us 
attempt to group them into classes. There seem to be numerous ways of classi- 
fying artificial neural systems for the purposes of study, analysis, understanding, 
and utilization. One such way of characterizing neural networks makes use of 
the faet that the recall of information can be effeeted in two ways. 

The recall can be performed in the feedforward mode, or from input toward 
output, only. Such networks are called feedforward and have no memory. Recall 
in such networks is instantaneous, thus the past time conditions are irrelevant 
for their computation. Feedforward network ’s behavior does not depend on what 
happened in the past but rather what happens now. Specifically, the network 
responds only to its present input. 

The second group of networks performs recall computation with feedback 
operational. These networks can be considered as dynamical systems, and a cer- 
tain time interval is needed for their recall to be completed. Feedback networks 
are also called recurrent. They interact with their input through the output. As we 
have seen, recurrent networks can operate either in a discrete- or continuous-time 
mode. 

Another meaningful basis for classification is to differentiate neural net- 
works by their leaming mode. As discussed in Sections 2.4 and 2.5, supervised 
and unsupervised leaming are the main forms of leaming. In addition, there is 
a large group of networks that does not undergo supervised or unsupervised 
training, because the complete design information is available a priori. Many 
artificial neural memories fail into this category. They are designed by recording 
or encoding desired equilibria. We say that such networks are trained in a batch 
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mode. The continuous or discrete data that need to be associated, or hetero- 
associated, are used for batch mode leaming. The weights of such networks 
remain fixed following the recording. The reader should be aware that because 
there are few standards for neural network terminology, some authors consider 
memories trained by recording, or in batch mode, as networks trained with su- 
pervision. 

Table 2.2 summarizes the taxonomy of the most important artificial neural 
system architectures. Thirteen different architectures are listed in the first column 


TABLE 2.2 

Classification of the most important artificial neural 
networks according to their learning and recall modes. 


Network Architecture 

Learning 
Mode 
S, U, R 

Recall 
Mode 
FF, REC 

Recall 

Time Domain 
CT, DT 

Single-layer Network of Discrete and 
Continuous Perceptrons (Figure 1.1) 

S 

FF 

— 

Multilayer Network of Discrete and 
Continuous Perceptrons (EEG spike 
detectors, ALVINN, Figures 1.3 and 2.9) 

S 

FF 


Gradient-type Network (Figures 1.8 and 2.15) 

R 

REC 

CT 

Linear Associative Memory 

R 

FF 

— 

Autoassociative Memory [Figures 1.5, 1.7, 2.12, and 2.16(a)] 

R 

REC 

DT or CT 

Bidirectional Associative Memory [Figure 2.16(b)] 
(also Multidirectional Associative Memory) 

R 

REC 

DT or CT 

Temporal Associative Memory 

R 

REC 

DT or CT 

Hamming Network 

R 

FF 

— 

MAXNET 

R (fixed) 

REC 

DT or CT 

Clustering Network (Figure 2.25) 

U 

FF 

— 

Counterpropagation Network (Figure 2.25 and 2.26) 

U + S 

FF 

— 

Self-Organizing Neural Array (Figure 1.11) 

u 

FF 

— 

Adaptive Resonance Theory 1 Network 

u 

REC 

DT or CT 


Learning Mode Recall Mode Recall Time Domain (only for recurrent networks) 

S— Supervised FF— Feedforward CT— Continuous-time 

U— Unsupervised REC— Recurrent DT— Discrete-time 

R— Recording (Batch) 
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of the table. Only basic network configurations covered in this text are included, 
therefore the table is by no means exhaustive. To link the discussed architectures 
with the examples of networks introduced in Chapters 1 and 2, a number of 
references are provided in the first column. 

As stressed before, the leaming and recall modes are the most important 
characteristics of the networks we study. Supervised /unsupervised/recording 
(batch) leaming modes are specified in the table for each network of interest. 
Feedforward, or recurrent recall mode, is also highlighted. Note that feedforward 
networks provide instantaneous recall. Additionally, continuous- and discrete-time 
networks have been marked within the group of recurrent networks. 

No judgment or opinion about comparative worth of networks is made here 
or at any other place in this book. Instead, the focus is on the mathematical and 
algorithmic principles of each network. 

Neural networks can also be characterized in terms of their input and output 
values as discrete (binary) or continuous. As mentioned earlier in this chapter, 
diverse neuron models are used in various networks, sometimes even a combi- 
nation of them. These aspects of artificial neural networks will be highlighted in 
the sections where they are discussed. 


CONCLUDING REMARKS 

At this point it may be desirable to reiterate and expand the introductory 
remarks made so far conceming the differences between the conventional and 
neural network computation. Table 2.3 provides a summary overview of the basic 
aspects of computation for each of the approaches. Inspection of the entries of the 
table indicates that neural networks represent collective, non-algorithmic, low- 
precision, nonlinear computing machines that leam during training from examples 
and are data-controlled. As such they are very different from programmable con- 
ventional computers. The reader is encouraged to look at the details provided in 
Table 2.3. 

To gain better understanding of the presented concepts, let us look for 
analogies between the theory of leaming neurons and the theory of leaming 
of organisms. According to psychologists, there are two basic forms of leaming. 
Operant conditioning is a form of leaming that reinforces only the responses 
of an organism. The reinforcement is aimed at making the desired responses 
more likely and ignoring or punishing those that are not desirable. In contrast to 
this form of leaming, classical conditioning is based on training for a stimulus- 
response sequence. 

Perceptron or delta training mies are aimed at influencing responses so that 
they become more desirable when the same input is applied repetitively. These 
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TABLE 2.3 

Comparison of conventional and 
network computation. 

neural 


Task or 

performance aspect 

Conventional 

computation 

Neural network 
computation 

Problem solving 

Algorithm formulation 

Selection of architecture 
and definition of the 
set of representative 
examples 

Input data 

Numerical form 

Numerical, but also 
perceptual representation 
allowed 

Knowledge acquisition 

Programming 

Training 

Knowledge retrieval 

Sequential computation 

Recall in the form 
of collective 
processing 

Computation 

High-precision arithmetic 

Low-precision, 
nonlinear mapping 

Intemal data 

Intemal representation in 
control of the algorithm 

Intemal representation 
in control 
of input data 

Fixed- or 

intermediate-data 

storage 

ROM, RAM-high-precision 
binary memories 

Interconnecting 
weights of typically 
continuous values 


rules will thus correspond to operant conditioning. The Hebbian leaming rule 
couples the input and output and involves the relationship between the present 
stimulus-response pair, or input-output correlation. Thus, this leaming rule cor- 
responds to the classical conditioning organism leaming. 

This chapter treats introductory concepts and definitions used in artificial 
neural systems. It contains a discussion of relationships between real neural 
systems and their artificial counterparts. The chapter provides an overview of the 
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fundamental terminology. It also introduces a historical model of the artificial 
neuron developed by McCulloch and Pitts. Subsequently, a number of other 
artificial neuron models are discussed. 

The basic taxonomy of neural networks as feedforward and recurrent recall 
systems is provided along with recall analysis examples. The most important 
features of the leaming modes are formulated and discussed in this chapter. The 
phenomenon of network leaming is explained based on the unifying general 
leaming rule and on the so-called leaming signal concept. This approach clari- 
fies how neural networks leam. Leaming mies for adapting neuron’s weights as 
introduced in this chapter provide the reference framework needed for further 
study. In subsequent chapters our focus is on what the leaming can accomplish, 
how it happens, and how to use it in the most efficient way. 


PROBLEMS 

Piease note that problems highlighted with an asterisk (*) are typically com- 
putationally intensive and the use of programs is advisable to solve them. 

P2.1 The logic networks shown in Figure P2.1 use the McCulloch-Pitts model 
neuron from Figure 2.3. Find the truth tables and the logic functions that 
are implemented by networks (a), (b), (c), and (d). 

P2.2 Use McCulloch-Pitts neurons to design logic networks that implement 
the following functions. Use a single neuron in (a), and two neurons in 
Cascade for (b) and (c). (A prime denotes a logic complement.) 

(a) o k+l = x\x k x k 




Figure P2.1 Neural networks for analysis of logic functions (neurons from Figure 2.3). 
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Figure P2.3 Randomizing shift register using neurons from Figure 2.3. 


(b) o k+2 = x’fåxf 

(c) o k+1 — x\x k 

P2.3 The sequential circuit shown in Figure P2.3 contains a shift register, which 
processes the incoming impulse sequence and memorizes it using the reg- 
ister cell property of the McCulloch-Pitts neurons. Assume that input data 
are x° = 1, x 1 - 1, x 2 = 0, x 3 = 1, x 4 = x 5 = x 6 = . . . = 0. Compute 
the contents of the register (o\,o\,o\,o\) after eight processing steps have 
elapsed. Assume all cells have initially cleared outputs at k = 0. 

P2.4 The feedforward network shown in Figure P2.4 using bipolar binary neu- 
rons is mapping the entire plane x x , x 2 into a binary o value. Find the 
segment of the Jtj, x 2 plane for which o 4 = 1, and its complement for 
which o 4 = —1. 

P2.5 Each of the two networks shown in Figure P2.5 implements an identi- 
cal function on unipolar binary vectors in two-dimensional input space. 
Analyze the networks to show that they are equivalent. Assume 



Figure P2.4 Feedforward network for Problem P2.4. 
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-1 



(a) 



Figure P2.5 Feedforward network for analysis in Problem P2.5. 


P2.6 The network shown in Figure P2.6 is an analog-to-digital converter and 
can be used for coding a continuous value x into a 4-bit unipolar bi- 
nary code (o 3 o 2 o x o 0 ). Analyze the network and find each range 
of x that is converted in each of the binary codes (0 0 0 0 ), . . . , 
(1 1 1 1 ). Assume -1 < * < 16 and unipolar binary neurons used 

as in Problem P2.5. 

P2.7 The feedforward network shown in Figure P2.7 has been designed to code 
the grey intensity of a pixel expressed on a continuous scale, 0 ^ ^ 1 

(Ramacher 1989). The output binary code is (q 3 q 2 q x )* Analyze the 
network and find each range of x that is converted into the binary codes 
(0 0 0), . . . , ( 1 1 1). Assume unipolar binary neurons as in Prob- 

lem P2.5. 

P2.8 The network shown in Figure P2.8 uses neurons with a continuous ac- 
tivation function as in (2.3a) with A = 1. The neuron’s output has 
been measured as o x — 0.28 and o 2 = —0.73. Find the input vector 
x = [x x x 2 ] f that has been applied to the network. Also find the slope 
values of the activation function at the activations net x and net 2 . 
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x 


Figure P2.6 Feedforward converter for analysis in Problem P2.6. 

P2.9 The network shown in Figure P2.9 using neurons with f(net) as in (2.3a) 
has been designed to assign input vectors x l5 x 2 , x 3 to cluster 1 or 2. The 
cluster number is identical to the number of the neuron yielding the larger 
response. Determine the most likely cluster membership for each of the 
folio wing three vectors. Assume Å = 2. The input vectors are 

[0.866] [-0.985] = [ 0.342' 

Xl [ 0.5 J * * 2 [-0.174]’ * 3 [-0.94 

P2.10 The two networks shown in Figure P2.10 implement partitioning of the 
plane x l9 x 2 and yield o x = ± 1 (network a, bipolar binary output neuron), 
and \o x \ < 1 (network b, bipolar continuous output neuron). The neurons 
used in both networks are with bipolar characteristics as in (2.3). Assume 
for network b that A = 1. For both networks: 

(a) Find <9 t as a function of x x with x 2 = +2. 

(b) Draw the line on x {9 x 2 plane separating positive and negative re- 
sponses for \xi \ <5 and \x 2 \ < 5. 

P2.ll The network shown in Figure P2.ll uses neurons with continuous bipolar 
characteristics with Å = 1. It implements partitioning of plane x x , x 2 and 
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Unipolar binary f(ne,)=\ Sgn (net)+[ 

neurons 

Figure P2.7 Feedforward encoding of an analog value. 


maps it into \o x | < 1. Analyze the network and obtain regions on the 
plane xj 9 x 2 with positive and negative responses o x for kl < 5 - n < 5 » 
and T x ~ T 2 = 0. Then simulate the network and tabulate the function 
°\ (*\>xj) in the domain of interest. 

P2J2 The network shown in Figure P2.12 uses neurons with continuous bipolar 
characteristics with A = 5. It implements mapping of plane x x , x 2 into 
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Figure P2.8 NetWork for Problem P2.8. 



Figure P2.9 NetWork for cluster Identification. 


|o,| < 1. Simulate the network and tabulate the function o t (x u x 2 ) for 
I^jI < 2.5 and \x 2 \ < 2.5. 

P2.13 Assume that the vertices of a three-dimensional bipolar binary cube are 
used to represent eight States of a recurrent neural network with three 
bipolar binary neurons. The equilibrium States areo 1 = [ — 1 — 1 — 1 ] 

and o 2 = [ 1 1 1 ] *. Sketch the desirable State transitions between the 

vertices. 

P2.14 A discrete-time four-neuron recurrent network as in Figure 2.10 in the 
text with bipolar binary neurons has the weight matrix 

r 0 - 1-1 r 
-1 0 1-1 
-1 1 0-1 
1-1-1 0 


w = 
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(b) 


Figure P2.10 Networks for Problem P2.10 and implemented partitioning: (a) network a and 
(b) network b. 
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Figure P2.11 NetWork and implemented partitioning for Problem P2.11. (Partitioning shown for 
7, # 0 and 7 2 # 0.) 


Find the sequence of State transitions after the network has been initialized 
at x° = [ - 1 - 1 - 1 1 ] r . Repeat for three other initializing vectors 


r 

i 

o 

f 

-1 

0 

r 

-i 

-i 

, X° = 

1 

, x u = 

-i 

i 


1 


-i 


P2.15* For the continuous-time network using the bipolar neurons shown in Fig- 
ure P2.15: 

(a) Obtain discretized differential equations. 
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Figure P2.12 NetWork and implemented partitioning for Problem P2.12. 


(b) Perform sample simulations showing that o x = [l 1 1 ] ? and 

° 2 = [ — 1 + — 1 + — 1 + ] r are equilibrium points stored in the net- 

work shown in the figure. 

P2.16 Four steps of Hebbian leaming of a single-neuron network as in Figure 
2.21 have been implemented starting with w 1 = [l — l]* for leaming 
constant c = 1 using inputs as folio ws: 


f 


0' 


'2' 


r 

-2 

- *2 = 

1 

. X 3 = 

3 

, X 4 = 

-i 
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o. 


■>- o 2 




Figure P2.15 Continuous-time feedback network for Problem P2.15. (/?» = 1, C = 1, R = 10, 
and A = 2.5.) 


Find final weights for: 

(a) bipolar binary f(net ) 

(b) bipolar continuous f(net), A = 1. 

P2.17 Implement the perceptron rule training of the network from Figure 2.23 
using f{net) = sgn (net), c - 1, and the following data specifying the 
initial weights w 1 , and the two training pairs 




0 

-1 

-1 



Repeat the training sequence (x,,^), (x 2 ,d 2 ) until two correct responses 
in a row are achieved. List the net k values obtained during training. 
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P2.18 A single-neuron network using f {net) = sgn(net) as in Figure 2.23 has 
been trained using the pairs of x h d t as shown below: 



r 

-2 

\ 



0 ' 

-1 

2 

\ 

*1 = 

3 

» d\ — ~ 1 

9 

* 2 = 

, d 2 = 1 


-1 

J 



-1 

/ 



'-2' 

\ 

*3 = 

0 

-3 

, d 3 = -1 


-1 

/ 


The final weights obtained using the perceptron rule are 


w 4 = [3 2 6 1]' 

Knowing that correction has been perf ormed in each step for c = 1, 
determine the following weights: 

(a) w 3 , w 2 , w 1 by back-tracking the training 

(b) w 5 , w 6 , w 7 obtained for steps 4, 5, and 6 of training by reusing the 
sequence (x,,d,), (x 2 ,rf 2 ), (x 3 ,d 3 ). 

P2.19 Perform two training steps of the network as in Figure 2.24 using the 
delta leaming rule for A = 1 and c = 0.25. Train the network using the 
following data pairs 



The initial weights are w 1 = [101]'. [Hint: Us &f\net) = (1 /2)(1 — o 2 ) 
and /(ner) as in (2.3a).] 


P2.20 A recurrent network with three bipolar binary neurons has been trained 
using the correlation leaming rule with a single bipolar binary input vector 
in a single training step only. The training was implemented starting at 
w° = 0, for c = 1. The resulting weight matrix is 


W = 


1 -1 -1 

-1 1 1 

-1 1 1 


Find the vectors x and d that have been used for training. (There are two 
different answers.) 

P2.21 Perform two training steps for the network as in Figure 2.24 using the 
Widrow-Hoff leaming rule. Assume the same training data as in Problem 
P2.19. 



REFERENCES 


P2.22* Write the program for analysis of two-layer feedforward networks. The 
user-specified parameters should include type of the activation function 
(2.3a-b), (2.4a-b), A (if needed), the size of the network, and test input 
vectors. Verify your program using data from Problems P2.7 and P2.12, 
and from Example 2.1. Note that each neuron needs to have one fixed 
input (bias) with an associated weight. 

P2.23*Implement the leaming algorithms by writing a program that can be used 
to train weights of a single neuron with up to six inputs. The program 
should allow the user to specify the leaming rule, the type of activa- 
tion function, A (if needed), training data, and to perform the specified 
number of training steps. Rules to include are Hebbian, perceptron, delta, 
and Widrow-Hoff. Verify your program using data from Problems P2.16, 
P2.17, P2.19, and P2.21. 


REFERENCES 

Amari, S. I. 1990. “Mathematical Foundations of Neurocomputing,” IEEE Proc. 
78(9): 1443-1463. 

Arbib, M. A. 1987. Brains, Machines and Mathematics, 2nd ed. New York: 
Springer Verlag. 

Bemasconi, J. 1988. “Analysis and Comparison of Different Leaming Algo- 
rithms for Pattem Association Problems,” in Neural Information Processing 
Systems, ed. D. Anderson. New York: American Institute of Physics. 

Carpenter, G. A. 1989. “Neural Network Models for Pattem Recognition and 
Associative Memory,” Neural Networks 2: 243-257. 

Dayhoff, J. 1990. Neural Network Architectures — An Introduction. New York: 
Van Nostrand Reinhold. 

Durbin, R. 1989. “On the Correspondence Between Network Models and the 
Nervous System,” in The Computing Neuron, ed. R. Durbin, C. Miall, G. 
Mitchison, Reading, Mass.: Addison-Wesley Publishing Co. 

Feldman, J. A., M. A. Fanty, and N. Goddard. 1988. “Computing with Structured 
Neural Networks,” IEEE Computer (March): 91-103. 

Grossberg, S. 1977. Classical and Instrumental Leaming by Neural Networks 
in Progress in Theoretical Biology, vol. 3. New York: Academic Press, 
51-141. 

Grossberg, S. 1982. Studies of Mind and Brain: Neural Principles of Leaming 
Perception, Development, Cognition, and Motor Control. Boston: Reidell 
Press. 


90 


FUNDAMENTAL CONCEPTS AND MODELS OF ARTIFICIAL NEURAL SYSTEMS 


Hebb, D. 0.11949. The Organization ofBehavior, a Neuropsychological Theory. 
New York: John Wiley. 

Hecht-Nielseh, R. 1987. “Counterpropagation Networks,” Appl. Opt. 26(23): 
4979-4984. 

Hecht-Nielsen, R. 1990. Neurocomputing. Reading, Mass.: Addison-Wesley 
Publishing Co. 

Hertz, J., A. Krogh, and R. G. Palmer. 1991. Introduction to the Theory of 
Neural Computation. Redwood City, Calif.: Addison-Wesley Publishing Co. 

Hopfield, J. J., and D. W. Tank. 1986. “Computing with Neural Circuits: A 
Model,” Science 233: 625-633. 

Lemer, A. Ya. 1972. Fundamentals of Cybernetics. London: Chapman and Hall 
Ltd. 

Lippmann, R. R 1987. “An Introduction to Computing with Neural Nets,” IEEE 
Magazihe on Acoustics, Signal and Speech Processing (April): 4-22. 

McClelland, T. L., D. E. Rumelhart, and the PDP Research Group. 1986. 
Parallel Distributed Processing. Cambridge: The MIT Press. 

McCulloch, W. S., and W. H. Pitts. 1943. “A Logical Calculus of the Ideas 
Imminent in Nervous Activity,” Buli Math . Biophy. 5: 115-133. 

Mitchison, G. 1989. “Leaming Algorithms and Networks of Neurons,” in The 
Computing Neuron , ed. R. Durbin, C. Miall, G. Mitchison. Reading, Mass.: 
Addison-Wesley Publishing Co. 

Poggio, T., and F. Girosi. 1990. “Networks for Approximation and Leaming,” 
Proc. IEEE 78(9): 1481-1497. 

Ramacher, Uj, Wesseling, M. 1989. “A Geometrical Approach to Neural NetWork 
Design,’’ Proc. of the Joint Neural Networks Conf. , Washington, D.C., pp. 
II, 147-^52. 

Reilly, D. L., and L. N. Cooper. 1990. “An OverView of Neural Networks: Early 
Models to Real World Systems,” in Introduction of Neural and Electronic 
Networks. New York: Academic Press. 

Rosenblatt, F. 1958. “The Perceptron: A Probabilistic Model for Information 
Storage and Organization in the Brain,” Psych. Rev. 65: 386-408. 

Rumelhart, D. E. 1990. “Brain Style Computation: Leaming and Generalization,” 
in Introduction to Neural and Electronic Networks. New York: Academic 
Press. 

Szu, H. 199C. “Reconfigurable Neural Nets by Energy Convergence Leaming 
Priniciples Based on Extended McCulloch-Pitts Neurons and Synapses,” 
Proc. IEEE Int. Neural Networks Conf. pp. I-485-I-496. 

Tsypkin, Ya. Z. 1973. Foundations of the Theory of Leaming Systems. New 
York: Academic Press. 



REFERENCES 


91 


Wasserman, P. D. 1989. Neural Computing Theory and Practice. New York: Van 
Nostrand Reinhold. 

Widrow, B. 1962. “Generalization and Information Storage in Networks of 
Adaline ‘Neurons’,” in Self-Organizing Systems 1962 , ed. M. C. Jovitz, G. 
T. Jacobi, G. Goldstein. Washington, D.C.: Spartan Books, 435^61. 




SINGLE-LAYER 
PERCEPTRON CLASSIFIERS 

A cat that once sat on a hot 
stove will never again sit on a 
hot stove or on a cold one either. 

M. TWAIN 

3.1 Classification Model, Features, and Decision Regions 

3.2 Discriminant Functions 

3.3 Linear Machine and Minimum Distance Classification 

3.4 Nonparametric Training Concept 

3.5 Training and Classification Using the Discrete Perceptron: 

Algorithm and Example 

3.6 Single-Layer Continuous Perceptron Networks for Linearly 
Separable Classifications 

3.7 Multicategory Single-Layer Perceptron Networks 

3.8 Concluding Remarks 


I n this chapter, the foundations of trainable decision-making networks will be 
formulated. The principal function of a decision-making system is to yield 
decisions conceming the class membership of the input pattem with which it 
is confronted. Conceptually, the problem can be described as a transformation 
of sets, or functions, from the input space to the output Space, which is called 
the classification space. In general, the transformations of input data into class 
membership are highly complex and noninvertible. 

We will develop the expertise gradually. The linear discriminant functions 
will be. introduced first, and the simple correction rule to perform network train- 
ing will be devised. The training, or network adaptation, will be presented as 
a sequence of iterative weight adjustments. Strong emphasis will be put on the 
geometrical interpretation of the training procedure. This will provide the reader 
with necessary insight and, hopefully, with a better understanding of mathematical 
methods for neural network-based classifiers. 

Starting with the definitions of basic concepts of classification and with 
examples of two-class classifiers using the hard-limiting thresholding device, the 
training rules will then be extended to the case of continuous error function 
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minimization. This extension will require the replacement of the summing and 
hard-limiting thresholding device, or a discrete perceptron, with a continuous per- 
ceptron. The chapter will conclude with preliminary discussion of multi-neuron 
multicategory single-layer networks and their training. The architectures discussed 
in this chapter are limited to single-layer feedforward networks. The chapter also 
provides an explanation and justification of perceptron and delta training rules 
introduced formally but without proof in the previous chapter. 

The approach presented in this chapter is introductory and applicable mainly 
to the classification of linearly separable classes of pattems; thus it may be viewed 
as having somewhat limited practical importance. However, the same approach 
can easily be generalized and applied later for different network architectures. 
As the reader will see, such extension will prove to be much more powerful 
and useful. The discussion of these more versatile and complex architectures and 
relevant training approaches will be presented in subsequent chapters. 


3.1 ii 1 .i 

classiHcation model, 

FEATURES, AND 
DECISION REGIONS 


Our discussion of neural network classifiers and classification issues has 
so far been rather informal. A simplistic two-class classifier (Figure 1.1) was 
presented in ISection 1.1. Also, classification was introduced in Section 2.3 as 
a form of neural computation and, specifically, as a form of information recall. 
This notion Was illustrated in Figure 2.17(a). We now approach the classification 
issues in more detail. 

One of ihe most useful tasks that can be performed by networks of in- 
terconnected nonlinear elements introduced in the previous chapter is pattem 
classification. A pattern is the quantitative description of an object, event, or 
phenomenon. The' classification may involve spatial and temporal pattems. Ex- 
amples of spatial pattems are pictures, video images of ships, weather maps, 
fingerprints, and characters. Examples of temporal pattems include speech sig- 
nals, signals vs. time produced by sensors, electrocardiograms, and seismograms. 
Temporal pattems usually involve ordered sequences of data appearing in time. 

The goa] of pattem classification is to assign a physical object, event, or 
phenomenon to one of the prespecified classes (also called categories.) Despite 
the lack of any formal theory of pattem perception and classification, human 
beings and animals have performed these tasks since the beginning of their 
existence. Let us look at some of the classification examples. 

The oldest classification tasks required from a human being have been clas- 
sification of the human environment into such groups of objects as li ving species, 
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plants, weather conditions, minerals, tools, human faces, voices, or silhouettes, 
etc. The interpretation of data has been leamed gradually as a result of repetitive 
inspecting and classifying of examples. When a person perceives a pattem, an 
inductive inference is made and the perception is associated with some general 
concepts or clues derived from the person’s past experience. The problem of 
pattem classification may be regarded as one of discriminating the input data 
within object population via the search for invariant attributes among members 
of the population. 

While some of the tasks mentioned above can be leamed easily, the growing 
complexity of the human environment and technological progress has created 
classification problems that are diversified and also difficult. As a result, the 
use of various classifying aids became helpful and in some applications, even 
indispensable. Reading and processing bank checks exemplifies a classification 
problem that can be automated. It obviously can be performed by a human 
worker, however, machine classification can achieve much greater efficiency. 

Extensive study of the classification process has led to the development of 
an abstract mathematical model that provides the theoretical basis for classifier 
design. Eventually, machine classification came to maturity to help people in their 
classification tasks. The electrocardiogram waveform, biomedical photograph, or 
disease diagnosis problem can nowadays be håndled by machine classifiers. Other 
applications include fingerprint identification, patent searches, radar and signal 
detection, printed and written character classification, and speech recognition. 

Figure 3.1(a) shows the block diagram of the recognition and classification 
system. As mentioned in Section 2.3, recognition is understood here as a class 
assignment for input pattems that are not identical to the pattems used for training 
of the classifier. Since the training concept has not been fully explained yet, we 
will focus first on techniques for classifying pattems. 

The classifying system consists of an input transducer providing the input 
pattem data to the feature extractor. Typically, inputs to the feature extractor 
are sets of data vectors that belong to a certain category. Assume that each 
such set member consists of real numbers corresponding to measurement results 
for a given physical situation. Usually, the converted data at the output of the 
transducer can be compressed while still maintaining the same level of machine 
performance. The compressed data are called features . The feature extractor at the 
input of the classifier in Figure 3.1(a) performs the reduction of dimensionality. 
The feature space dimensionality is postulated to be much smaller than the dimen- 
sionality of the pattem space. The feature vectors retain the minimum number of 
data dimensions while maintaining the probability of correct classification, thus 
making handling data easier. 

An example of possible feature extraction is available in the analysis of 
speech vowel sounds. A 16-channel filterbank can provide a set of 16-component 
spectral vectors. The vowel spectral content can be transformed into perceptual 
quality space consisting of two dimensions only. They are related to tongue height 
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Raw 



Pattem 


Figure 3.1 Rscognition and classification system: (a) overall block diagram and (b) pattern 
classifier. 

and retractio i. Another example of dimensionality reduction is the projection of 
planar data on a single line, reducing the feature vector size to a single dimension. 
Although the projection of data will often produce a useless mixture, by moving 
and /or rotating the line it might be possible to find its orientation for which 
the projectec data are well separated. In such a case, two-dimensional data are 
represented by single-dimensional features denoting the position of the projected 
points on the line. 

It is beyond the scope of this chapter to discuss the selection of measure- 
ments or data feature extraction from the input pattem vector. We shall henceforth 
assume that the sets of extracted feature vectors yield the sets of pattem vectors 
to be classified and that the extraction, or selection, of input components to the 
classifier has been done as wisely as possible. Thus, the pattem vector x shown 
in Figure 3.1 (b) consists of components that may be features. Chapter 7 covers 
neural network architectures suitable for separate feature extractions. 

However, the /i-tuple vectors at the input to the classifier on Figure 3 . 1 (b) 
may also be input pattem data when separate feature extraction does not take 
place. In sucti a case the classifier ’s function is to perform not only the classifica- 
tion itself but also to intemally extract input pattem features. The rationale for this 
approach in our study is that neural networks can be successfully used for joint 
classification /recognition tasks and for feature extraction. For such networks, the 
feature extractor and classifier from Figure 3.1(a) can be considered merged to 
the single classifier network of Figure 3.1(b). Networks that operate on input 
data and peform no separate feature extraction when classifying pattems are 
discussed in Section 8.2. 
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Figure 3.2 Two simple ways of coding patterns into pattern vectors: (a) spatial object and 
(b) temporal object (waveform). 


In further discussion we will represent the classifier input components as a 
vector x. The classification at the system’s output is obtained by the classifier 
implementing the decision function i 0 (x). The discrete values of the response 
i Q are 1 or 2 or . . . or R. The responses represent the categories into which the 
pattems should be placed. The classification (decision) function of Equation (3.1) 
is provided by the transformation, or mapping, of the n-component vector x into 
one of the category numbers i 0 as shown in Figure 3.1(b): 

i O = *'o(x) (3.1) 

where 



Figure 3.2 depicts two simple ways to generate the pattem vector for cases 
of spatial and temporal objects to be classified. In the case shown in Figure 
3.2(a), each component x t of the vector x = [jiq x 2 x n ]* is assigned 

the value 1 if the fth cell contains a portion of a spatial object; otherwise, the 
value 0 (or -1) is assigned. In the case of a temporal object being a continuous 
function of time t , the pattem vector may be formed at discrete time instants t t 
by letting x t = /(/,), for i = 1, 2, . . . , n. This is shown in Figure 3.2(b). 

Classification can often be conveniently described in geometric terms. Any 
pattem can be represented by a point in n-dimensional Euclidean space E n called 
the pattern space. Points in that space corresponding to members of the pattem 
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Figure 3.3 Decision regions example. 

set are n- tuple vectors x. A pattem classifier maps sets of points in E n Space into 
one of the numbers i Q = 1, 2, as described by the decision function (3.1). 

The sets containing pattems of classes 1, 2, are denoted here by <% l9 dC 2 , 

. . . , %C R9 respectively. 

An example case for n = 2 and R = 4 is illustrated in Figure 3.3 showing 
disjoint regions d £ x , ..., % 4 . Let us postulate for simplicity that the classifier’s 
response as in (3.1) should be the class number. We now have the decision 
function for a pattem of class j yielding the following result: 

i 0 (x) = j for all x g j = 1, 2, 3, 4 

Thus, the example vector x = [20 10 ] r belongs to $C 2 and is of class 2, vector 

x = [ 4 6 belongs to <% 3 and is of class 3, etc. 

The regions denoted are called decision regions. Regions are separated 
from each other by so-called decision surfaces. We shall assume that pattems 
located on decision surfaces do not belong to any category. In Figure 3.3 an ex- 
ample of such a pattem located on the boundary is x = [ — 10 10 ] . Note that 

the decision surfaces in two-dimensional pattem space É 2 are curved lines. For 
a more general case of space E n they may be (n — l)-dimensional hypersurfaces. 
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3.2 

DISCRIMINANT FUNCTIONS 

In this chapter, the assumption is made that both a set of ^-dimensional 
pattems x 1? x 2 , . . . , x P and the desired classification for each pattem are known. 
The size P of the pattem set is finite, and it is usually much larger than the di- 
mensionality n of the pattem space. In many practical cases we will also assume 
that P is much larger than the number of categories R. Although the assumptions 
regarding n , P , and R are often valid for practical classification cases, they do 
not necessarily hold for our study of classification principles, nor do they limit 
the validity of our final conclusions. 

We will first discuss classifiers that use the discriminant functions concept. 
This discussion will lead to interesting conclusions as to how neural network 
classifiers should be trained. The study will also explicitly produce some of the 
training mies introduced in Chapter 2. 

Let us assume momentarily, and for the purpose of this presentation, that 
the classifier has already been designed so that it can correctly perform the 
classification tasks. During the classification step, the membership in a category 
needs to be determined by the classifier based on the comparison of R discrim- 
inant functions gi(x), g 2 ( x )> g/?( x )> computed for the input pattem under 

consideration. It is convenient to assume that the discriminant functions g { (x) are 
scalar values and that the pattem x belongs to the V th category if and only if 

fc(x) > gj(x), for ij = 1, 2, ± j (3.2) 

Thus, within the region <% i9 the ftli discriminant function will have the largest 
value. This maximum property of the discriminant function gfx) for the pattem 
of class i is fundamental, and it will be subsequently used to choose, or assume, 
specific forms of the gfx) functions. 

The discriminant functions gfx) and gj(x) for contiguous decision regions 
and define the decision surface between pattems of classes i and j in 
E n space. Since the decision surface itself obviously contains pattems x without 
membership in any category, it is characterized by gfx) equal to gj(x). Thus, the 
decision surface equation is 

gi(x) - gj(x) = 0 (3.3) 

Figure 3.4(a) displays six example pattems belonging to one of the two classes, 
with the simplest example of a decision surface in pattem space x x , x 2 being 
here a straight line. The case illustrated here is for n = R = 2 and exemplifies 
the concept of the linear discriminant function. Inspection of the figure indicates 
that there is an infinite number of discriminant functions yielding correct classi- 
fication. Let us look at the particulars of the depicted classification task in the 
following example. This example will also allow the reader to gain better insight 
into the more formal discussion of classification issues that follows. 
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EXAMPLE 3.1 


Six patt ems in two-dimensional pattem space shown in Figure 3.4(a) need 
to be classified according to their membership in sets as follows 

{[0 0] r , [ —0.5 — 1]', [ — 1 -2] r j : class 1 

{[2 0] f , [ 1.5 -1]',[1 -2]'} : class 2 

Inspectbn of the pattems indicates that the equation for the decision surface 
can be arbitrarily chosen as shown in the figure 

g(x) = -2x l + x 2 + 2 (3.4) 

Let us note that in the case discussed here we first arbitrarily select the 
Eq. (3.4) of the decision surface rather than determine the two discriminant 
functior s g,(x), for i = 1,2. Equation (3.4) represents the straight line divid- 
ing the oattem space that is plane x u x 2 into the contiguous decision regions 
d£ t , r Æ 2 . It is obvious that g(x) > 0 and g(x) < 0 in each of the half-planes 
containing pattems of class 1 and 2, respectively, and g(x) = 0 for all points 
on the ine. Therefore, the evaluation of the sign of g(x) = g t (x) - g 2 (x) 
can in this case replace the evaluation of the general maximum condition 
as in (3.2). Specifically, the functions g,(x) and g 2 (x) have not even been 
searched for in this case. Let us note that following explicitly the maximum 
conditic n expressed by (3.2) we would have to find and compare two specific 
discriminant functions. 

The following discussion analyzes classification using original condi- 
tion (3.2) with two suitably chosen discriminant functions: gj(x) and # 2 ( x )- 
Note that according to Equation (3.3), the projection of the intersection of 
two discriminant functions on the plane v, , x 2 is the decision surface given 
by (3.4). The example discriminant functions have been arbitrarily chosen 
as planes of #,(x) and g 2 (x) shown in Figure 3.4(b). The reader can see 
that they fulfill the correct classification requirements. Their contour maps 
are illustrated in Figure 3.4(c). The plane equations are: 

2*,-* 2 + 2 s ,(x)-4 = 0 (36a) 

-2x x + x 2 + 2g 2 (x) = 0 

Note how appropriate plane equations have been produced. We first observe 
that the decision line given by the equation — 2x x +x 2 +2 = 0 has two normal 
vectors. They are planar vectors [2 —1 ]* and [—2 1 ] . The discriminant 

function g x (x) can be built by appropriately selecting its 3-tuple unit normal 
vector ]\. This can be done by augmenting the vector [2 — 1 ] r by a third 

component of positive value, say equal to 2. The details of the procedure of 
budding this vector are shown in Figure 3.4(d). The resulting normal vector 
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is [ 2 - 1 2 ] it thus needs normalization. Similar augmentation can also 

be carried out for the other planar normal vector [—2 1 ] to yield r 2 . 

The unit normal vectors r x and r 2 , which pro vide appropriate plane 
orientations as shown in Figure 3.4(b) and correct discrimination by the 
classifier are thus equal (after normalization) 



(3.5b) 


Let us arbitrarily assume that the discriminant function planes, g t (x), inter- 
sects the point x x = 1, x 2 — 0, g x = 1. The discriminant function plane 
g 2 (x) has to intersect therefore the same point x x = 1, x 2 = 0, g 2 = 1. For 
each of the planes we have produced their normal vector and the coordinates 
of the intersecting point represented by a vector Q — [ 1 0 l]\ Using 

the normal vector-point equation from the Appendix we obtain for the first 
and second discriminant functions, respectively 



(3.5c) 


The reader can notice that equations (3.5c) are identical with the equations 
(3.5a). 

From (3.5a) we see that the explicit equations yielding the discriminant 
functions are: 


gl(x) = 


&(*)' = 





(3.6) 


The reader can easily verify that equating g { (x) to g 2 (x) given by Equation 
(3.6) leads to the decision surface equation (3.4). Let us point out that the 
decision surface does not uniquely specify the discriminant functions. Vec- 
tors r j and r 2 as chosen in (3.5b) are also not unique and there are an infinite 
number of vectors that could be used here. Also, the same arbitrary constant 
can be added to both gj(x) and g 2 (x) without changing the projection of the 
gi(x), g 2 (x) intersection on the plane x l9 x 2 . 
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To conclude the example, let us compare the computed values of dis- 
criminant functions g t (x) for selected pattems. The comparison should yield 
the complete class membership information. Indeed, pattem [2 0] f belongs 
to since g 2 (2 0) = 2 > g x (2 0) = 0, pattem [0 0]* belongs to 

3C 1 since g x ( 0 0) = 2 > g 2 (0 0) = 0, etc. In this example, the decision 
surface has been determined by inspection of pattems, and the example dis- 
criminant functions have been produced and discussed based on inspection 
of the geometrical conditions in the pattem space. h 


Assumi ig that the discriminant functions are known, the block diagram of 
a basic pattem classifier can now be adopted as in Figure 3.5(a). For a given 
pattem, the fth discriminator computes the value of the function g,(x) called 
briefly the discriminant . The maximum selector implements condition (3.2) and 
selects the largest of all inputs, thus yielding the response equal to the category 
number i Q . 

The discussion above and the associated example of classification has high- 
lighted a special case of the classifier into R classes for R — 2. Such a classifier 
is called the dichotomizer ; Although the ancient Greek civilization is rather fa- 
mous for other interests than decision-making machines, the word dichotomizer 
is of Greek origin. The two separate greek language roots are dicha and tomia 
and they mean in two and cut , respectively. It has been noted that the general 
classification condition (3.2) for the case of a dichotomizer can now be reduced 
to the inspection of the sign of the following discriminant function 

g(x) = g j (x) - g 2 (x) (3.7a) 

Thus, the general classification rule (3.2) can be rewritten for a dichotomizer as 
follows 

g« > 0 : class 1 
g(x) < 0 : class 2 

The evaluation of conditions in (3.7b) is easier to implement in practice than 
the selectior of maximum. Subtraction and sign examination has replaced the 
maximum value evaluation. A single threshold logic unit (TLU) can be used to 
build such a simple dichotomizer as shown in Figure 3.5(b). As discussed in the 
previous clupter, the TLU can be considered as a binary (discrete) version of a 
neuron. The TLU with weights has been introduced in Chapter 2 as the discrete 
binary perceptron. The responses 1 , — 1 , of the TLU should be interpreted as in- 
dicative of categories 1 and 2, respectively. The TLU element simply implements 
the sign function defined as 


i a = sgn [g(x)] = 


— 1 for g(x) < 0 

undefined for g(x) = 0 
1 for g(x) > 0 
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Figure 3.5 Computing the classification: (a) into R categories, (b) dichotomizer (R = 2), and 
(c) decision surface for n = 2 and R = 2. 


Another example of a decision surface for d = R = 2 is shown in Figure 
3.5(c). Although we can see that the TLU element would probably provide de- 
cision about the class membership of the depicted set of pattems, the design of 
the discriminator for this dichotomizer does not seem as straightforward as in 
Example 3.1. Let us only notice that the discriminant functions may result as 
nonlinear functions of x { , x 2 , and let us postpone more detailed discussion of 
such a case until later in this chapter. 

Once a general functional form of the discriminant functions has been suit- 
ably chosen, discriminants can be computed using a priori information about 
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the classificat on of pattems, provided that such information is available. In such 
an approach, the design of a classifier can be based entirely on the computa- 
tion of decision boundaries as derived from pattems and their membership in 
classes. Throughout this chapter and most portions of this book, however, we 
will focus manly on classifiers whose decision capabilities are generated from 
training pattems by means of an iterative leaming, or training, algorithm. Once 
a type of discriminant function has been assumed, the algorithm of leaming 
should result in a solution for the initially unknown coefficients of discriminant 
functions, prc vided the training pattem sets are separable by the assumed type 
of decision function. For study of such adaptive, or trainable, classifiers, the 
following assumptions are made: 

1. The training pattem set and classification of all its members are known, 
thus the training is supervised. 

2. The discriminant functions have a linear form and only their coefficients 
are adjusted in the training procedure. 

Under these assumptions, a trainable classifier can be implemented that leams by 
examples. In this context, we will be interested in input data vectors for which 
we have a priori knowledge of their correct classification. These vectors will 
be referred to as class prototypes or exemplars. The classification problem will 
then be one of finding decision surfaces, in n-dimensional space, that will enable 
correct classification of the prototypes and will afford some degree of confidence 
in correctly recognizing and classifying unknown pattems that have not been used 
for training. The only limitation regarding the unknown pattems to be recognized 
is that they are drawn from the same underlying distributions that have been used 
for classifier’s training. 


3.3 

LINEAR MACHINE AND 
MINIMUM DISTANCE 
CLASSIFICATION 

The efficient classifier having the block diagram as shown in Figure 3.5(a) 
must be desciibed, in general, by discriminant functions that are not linear func- 
tions of the inputs x t , x 2 , . . . , x n . An example of such classification is provided 
in Figure 3.5tc). As will be shown later, the use of nonlinear discriminant func- 
tions can be avoided by changing the classifier ’s feedforward architecture to 
the multilayer form. Such an architecture is comprised of more layers of el- 
ementary classifiers such as the discussed dichotomizer or the “dichotomizer” 
providing cortinuous response between —1 and +1. The elementary decision- 
making discrete, or continuous dichotomizers, will then again be described with 
the argument g(x) being the basic linear discriminant function. 


Since the linear discriminant function is of special importance, it will be dis- 
cussed below in detail. It will be assumed throughout that E n is the //-dimensional 
Euclidean pattem space. Also, without any loss of generality, we will initially 
assume that R — 2. In the linear classification case, the decision surface is a hy- 
perplane and its equation can be derived based on discussion and generalization 
of Figure 3.6. 

Figure 3.6 depicts two clusters of pattems, each cluster belonging to one 
known category. The center points of the clusters shown of classes 1 and 2 are 
vectors x { and x 2 , respectively. The center, or prototype, points can be interpreted 
here as centers of gravity for each cluster. We prefer that the decision hyperplane 
contain the midpoint of the line segment connecting prototype points P { and P 2 , 
and it should be normal to the vector x x — x 2 , which is directed toward P x . 
The decision hyperplane equation can thus be written in the following form 
(see Appendix): 

(X! - x 2 )'x + i (||x 2 || 2 - HxJI 2 ) - 0 (3.9) 

The left side of Equation (3.9) is obviously the dichotomizer’s discriminant func- 
tion g(x). It can also be seen that g(x) implied here constitutes a hyperplane 
described by the equation 


or, briefly. 


W\X\ + W 2 X 2 + ■■■ + w„x„ + w n+l = 0, or 
w'x + w n+l = 0 


(3.10a) 

(3.10b) 


w 

Wn + 1 



t 


(3.10c) 
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where w denotes the weight vector defined as follows: 


The weighting coefficients w u w 2 , . .., w n+l of the dichotomizer can now be 
obtained easily from comparing (3.9) and (3.10) as follows: 


*1 “ X 2 

= \ (ll X 2l| 2 - ll x ll| 2 ) 


It can be seer 
plicitly knowi 
unless the clu: 
a priori using 
The lines 
between more 
there will be 
(3.11) for R : 
a larger numt 
ous, thus elir 

&(*) = i'/ x ) 
be considered 
classifiers wil 
Let us as 
pattems into 
prototype poii 
Euclidean disi 
expressed by 


A minimum-«: 
classification 
smallest dista 
squared distai 


Obviously, th 
distances und 
only R terms, 
X; this term t< 
the largest of 


i from Equation (3.11) that the discriminant function becomes ex- 
1 if prototype points P x and P 2 are known. We can also note that 
ster center coordinates x l9 x 2 are known, g(x) cannot be determined 
the method just presented. 

r form of discriminant functions can also be used for classifications 
than two categories. In the case of R pairwise separable classes, 
up to R(R — 1)12 decision hyperplanes like the one computed in 
= 2. For R = 3, there are up to three decision hyperplanes. For 
)er of classes, some decision regions 9£ i9 may not be contigu- 
linating some decision hyperplanes. In such cases, the equation 
las no solution. Still, the dichotomizer example just discussed can 
as a simple case of a multiclass minimum-distance classifier. Such 
i be discussed in more detail later in this chapter. 
sume that a minimum-distance classification is required to classify 
3ne of the R categories. Each of the R classes is represented by 
lts P l9 P 2 , ... 9 P R being vectors x l9 x 2 , . . . , x R , respectively. The 
ance between input pattem x and the prototype pattem vector x t is 
the norm of the vector x — x, as follows: 

ll x - X/ll = a/( x - x ;)'( x - x i) (3.12) 

istance classifier computes the distance from pattem x of unknown 
to each prototype. Then, the category number of that closest, or 
ice, prototype is assigned to the unknown pattem. Calculating the 
ces from Equation (3.12) yields 

||x - Xjj | 2 = x'x - 2x'x + x'x /9 for i = 1, 2, . . . , R (3.13) 

s term x r x is independent of i and shows up in each of the R 
^r evaluation in Equation (3.13). Thus, it will suffice to compute 
2x^x-x^ Xp for i = 1, in (3.13), and to determine for which 

tkes the largest of all R values. It can also be seen that choosing 
the terms xjx - 0.5xjx { is equivalent to choosing the smallest of 
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the distances ||x - x,||. This property can now be used to equate the highlighted 
term with a discriminant function g,(x): 

g,-(x) = x'x - ^x'x,, for i = 1, 2, . . . , R (3.14) 

It now becomes clear that the discriminant function (3. 14) is of the general linear 
form, which can be expressed as: 

8i00 = w'x + w in+l , for i = 1, 2, . . . , R (3.15) 

The discriminant function coefficients that are weights w z - can be determined by 
comparing (3.14) and (3.15) as follows: 


W/ 


= X; 


w, 


i,n+\ 


"2 


for i = 1, 2, . . . , R 


(3.16) 


At this point, note that minimum-distance classifiers can be considered as linear 
classifiers, sometimes called linear machines. Since minimum-distance classifiers 
assign category membership based on the closest match between each prototype 
and the current input pattem, the approach is also called correlation classification. 
The block diagram of a linear machine employing linear discriminant functions as 
in Equation (3.15) is shown in Figure 3.7. It can be viewed as a special case of the 
more general classifier depicted in Figure 3.5. The machine consists of R scalar 
product computing nodes and of a single maximum selector. During classification, 
after simultaneously computing all of the R discriminants g,(x) for a submitted 
pattem, the output stage of the classifier selects the maximum discriminant and 
responds with the number of the discriminant having the largest value. 

Let us finally notice that the decision surface Sy for the contiguous decision 
regions %C h cJCj is a hyperplane given by the equation 

S/(x) - gj(x) = 0, or (3.17a) 

w'x + w,.„ +1 - w'x - w jn+l = 0 (3.17b) 

It is a widely accepted convention to append formally a 1 as the n + 1 ’th com- 
ponent of each pattem vector. The augmented pattem vector is now denoted by 
y, it consists of n + 1 rows, and is defined as follows: 



Using the notation of the augmented pattem vector allows for rewriting expres- 
sion (3.15) for the linear discriminant function to the more compact form of 

£;( y) = w'y (3.19) 

Note, however, that whenever the augmented pattem vector is used, the associated 
weight vector w contains n 4- 1 components. The augmenting weight component 
is w in + l , for i — 1,2,..., R. For the sake of notational simplicity, the notation 
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for both weight vectors and augmented weight vectors are the same throughout 
the text. Whether or not a pattem or weight vector has been augmented can 
usually be determined from the context. 


EXAMPLE 3.2 


In this example a linear (minimum-distance) classifier is designed. Decision 
lines are generated using a priori knowledge about the center of gravity of 
the prototype points. The assumed prototype points are as shown in Figure 
3.8(a) ar d their coordinates are 



'10' 


2 


'-5' 

X 1 = 

2 

. X 2 = 

-5 

x 3 = 

5 
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' 10' 


' 2 ' 


' -5' 

Wi = 

2 

-52. 

. w 2 = 

-5 

-14.5 

, w 3 = 

5 

.-25. 


It is also assumed that each prototype point index corresponds to its class 
number. Using formula (3.16) for R = 3, the weight vectors can be obtained 
as 


(3.20a) 


Note that the weight vectors in (3.20a) are augmented. The corresponding 
linear discriminant functions are 

g x (x) = I0x x + 2jc 2 “ 52 

g 2 (x) = 2x y — 5*2 — 14.5 (3.20b) 

g 3 (x) = -5x { + 5*2 — 25 

The three discriminant functions are shown in Figure 3.8(b), and their con- 
tour map in the pattem space x u x 2 is illustrated in Figure 3.8(c). Based on 
the computed discriminant functions, a minimum-distance classifier using the 
maximum selector can be completed for R — 3. The resulting classifier is 
shown in Figure 3.8(d). Inspecting Figure 3.8(a) reveals that there are three 
decision lines S 12 , S 13 , and S 23 separating the contiguous decision regions 
d£ x , dC 2 ; §£ 3 ; and dC 2 , 3£ 3 , respectively. These lines are given by the pro- 
jections on the pattem plane x x ,x 2 of the intersections of the discriminant 
functions of Figures 3.8(b) and (c). The decision lines can be calculated by 
using the condition (3.17) and the discriminant functions (3.20b) as 


^12 

Sl3 

5*23 


8*! + 7*2 “ 37.5 = 0 

- 15*j + 3*2 + 27 = 0 

— 7 *! + 10*2 ~ 10*5 = 0 


(3.20c) 


Substituting the weight vectors (3.20a) directly into the decision function 
formula (3.17b) also results in the decision lines (3.20c). In this example 
we have derived the weight vectors and equations for decision surfaces for 
a three-class classifier. We have also provided insight into the geometrical 
relationships in the pattem space. mi 

The notion of linearly separable pattems is now introduced. Assume that 
there is a pattem set S£. This set is divided into subsets $C 1 , 9£ 2 , . .., re- 
spectively. If a linear machine can classify the pattems from as belonging to 
class /, for i — 1, 2, R, then the pattem sets are linearly separable . Using 
this property of the linear discriminant functions, the linear separability can be 
formulated more formally. If R linear functions of x as in (3.10) exist such that 

gi(x) > gj(x) for all x e i = 1, 2, j = 1,2,...,/?, i±j 

then the pattem sets are linearly separable. 

Figure 3.9 shows the example sets of linearly nonseparable pattems in two- 
and three-dimensional pattem space. It depicts the parity function f(x x , * 2 , ...,*„) 



O Class 2 


(a) (b) 

Figure 3.9 Example of parity function resulting in linearly nonseparable patterns (R = 2): 

(a) x ! © x 2 and (b) x 1 © x 2 © x 3 . 

for two and three variables (see Appendix). Figure 3.9(a) exemplifies the bipolar 
and Figure 3.9 (b) the unipolar version of the parity function spanned at the cor- 
ners of an ^-dimensional hypercube. The parity function as defined in Equation 
(3.21) is often called the XOR function: 

XOR(x l9 x 2 ^--,x n ) = x x ® jc 2 0 (3.21) 

where the syn bol © denotes the Exclusive OR Boolean function operator. This 
function is often convenient to exemplify classification of pattems that are linearly 
nonseparable. 

Geometrically, it can easily be seen that if hyperplanes exist that divide the 
sets 3E l9 $£ 2 , • • • , then the pattems are linearly separable. In contrast to Fig- 
ures 3.9, 3.3, and 3.5(c) showing examples of linearly nonseparable pattems. 
Figures 3.4, 3.6, 3.8(a) show classification of pattems that are linearly separable. 
It can be noticed that decision surfaces for linearly separable pattems define 
convex decision regions in the pattem space. 


3.4 | 

NONPARAMETRIC 
TRAININ 3 CONCEPT 

Thus far our approach to the design of pattem classifiers has been analytical 
and based on computations of decision boundaries derived from inspection of 
sample patterr s, prototypes, or their clusters. In theoretical considerations and 
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Figure 3.10 Linear dichotomizer using hard-limiting threshold element, or the TLU-based per- 
ceptron. 


Examples 3.1 and 3.2, we have shown that coefficients of linear discriminant 
functions called weights can be determined based on a priori information about 
sets of pattems and their class membership. 

In this section we will begin to examine neural network classifiers that 
derive their weights during the leaming cycle. The sample pattem vectors x u 
x 2 , ..., x P , called the training sequence, are presented to the machine along 
with the correct response. The response is provided by the teacher and specifies 
the classification information for each input vector. The classifier modifies its 
parameters by means of iterative, supervised leaming. The network leams from 
experience by comparing the targeted correct response with the actual response. 
The classifier structure is usually adjusted after each incorrect response based on 
the error value generated. 

Let us now look again at the dichotomizer introduced and defined in Section 
3.2. We will develop a supervised training procedure for this two-class linear 
classifier. The expanded diagram of the dichotomizer introduced originally in 
Figure 3.5(b) is now redrawn in Figure 3.10. The dichotomizer shown consists 
of n + 1 weights and the TLU performing as a binary decision element. It is 
identical to the binary bipolar perceptron from Figure 2.7(a). The TLU itself can 
be considered a binary response neuron. The input to the binary response neuron 
is the weighted sum of components of the augmented input vector y. 

In the next part of this section, we discuss the adaptive linear binary classifier 
and derive the perceptron training algorithm based on the originally nonadaptive 
dichotomizer. Assuming that the desired response is provided, the error signal is 
computed. The error information can be used to adapt the weights of the discrete 
perceptron from Figure 3.10. First we examine the geometrical conditions in 
the augmented weight space. This will make it possible to devise a meaningful 
training procedure for the dichotomizer under consideration. 

From previous considerations we know that the decision surface equation 
in /i-dimensional pattem space is 

w'x + w n+l = 0 


(3.22a) 
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Figure 3.11 Decision hyperplane in augmented weight space for a five pattern set from two 
classes. 


and Equation [3.22a) can be rewritten in the augmented weight space E n+l as 

w'y = 0 (3.22b) 

This is a normal vector-point equation and is the first equation in this text con- 
sidered in the weight space as compared to the earlier equations evaluated in the 
pattem space. Equation (3.22b) describes a decision hyperplane in augmented 
weight space. In contrast to the decision hyperplane of Equation (3.22a) in n- 
dimensional space, this hyperplane always intersects the origin, point w = 0. Its 
normal vector, which is perpendicular to the plane, is the pattem y. To visualize 
this, five example decision hyperplanes in the augmented weight space for five 
prototype pattems of two classes are shown in Figure 3.11. 
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In further discussion it will be understood that the normal vector will always 
point toward the side of the space for which w'y > 0, called the positive side, 
or semispace, of the hyperplane. Accordingly, the vector y x points toward the 
positive side of the decision hyperplane w*y } = 0 in Figure 3.11. The pattem 
vector orientation is thus fixed toward class 1 decision half-plane. The same 
remarks also apply to the remaining four pattems y 2 , . . . , y 5 , for which decision 
hyperplanes and normal vectors have also been shown in the figure. By labeling 
each decision boundary in the augmented weight space with an arrow pointing 
into the positive half-plane, we can easily find a region in the weight space 
that satisfies the linearly separable classification. (This notation using arrows or 
pattem vectors for pointing positive half-space is used throughout the text.) To 
find the solution for weights, we will look for the intersection of the positive 
decision regions due to the prototypes of class 1 and of the negative decision 
regions due to the prototypes of class 2. 

Inspection of the figure reveals that the intersection of the sets of weights 
yielding all five correct classifications of depicted pattems is in the shaded region 
of the second quadrant as shown in Figure 3.11. Let us now attempt to arrive 
iteratively at the weight vector w located in the shaded weight solution area. To 
accomplish this, the weights need to be adjusted from the initial value located 
anywhere in the weight space. This assumption is due to our ignorance of the 
weight solution region as well as weight initialization. The adjustment discussed, 
or network training, is based on an error-correction scheme. 

As shown below, analysis of geometrical conditions can provide useful 
guidelines for developing the weight vector adjustment procedure. However, if the 
dimensionality of the augmented pattem vector is higher than three, our powers of 
visualization are no longer of assistance in determining the decision or adjustment 
conditions. Under these circumstances, the only reasonable recourse would be to 
follow the analytical approach. Fortunately, such an analytical approach can be 
devised based on geometrical conditions and visualization of decision boundaries 
for two or three dimensions only of pattem vectors. 

Figure 3.12(a) shows a decision surface for the training pattem y x in the 
augmented weight space of the discrete perceptron from Figure 3.10. Assume 
that the initial weights are w 1 (case A) and that pattem y x of class 1 is now 
input. In such a case, weights w 1 are located in the negative half-plane. Obvi- 
ously, the pattem is misclassified since we have for the present weights w 1 and 
pattem y 1 a negative decision due to w 1 ^ < 0. To increase the discriminant 
function w lf y l9 we need to adjust the weight vector, preferably in the direction of 
steepest increase, which is that of a gradient. The gradient V H; can be expressed 
as follows: 

VnXw'y,) = y, (3.23) 

Thus, when the pattem of class 1 is misclassified, the adjusted weights should 
preferably become 


w' = w 1 4- cy x 


(3.24) 
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(a) 

Figure 3.12a Weight adjustments of learning dichotomizer: (a) steepest descent. 

where constar t c > 0 is called the correction increment , and a prime is used 
to denote weights after correction. It can be seen that c Controls the size of the 
adjustment. Repeating the weight adjustment a number of times would bring the 
weight into the shaded solution region, independent of the chosen correction 
increment. 

Case B ir Figure 3.12(a) illustrates a similar misclassification with the initial 
weights now at w 1 and y x of class 2 being input. Since w u yj > 0 and misclas- 
sification occurs also in this case, the discriminant function needs to be changed 
through weight adjustment, preferably in the steepest decrease, or negative gra- 
dient, directio i. We thus obtain from Equation (3.23) the following expression 
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► 


(b) 

Figure 3.12b Weight adjustments of learning dichotomizer (continued): (b) example. 


for the adjusted weights: 

W ' = w 1 — cy x (3.25) 

An illustration of a series of weight adjustments in the augmented weight Space is 
shown in Figure 3.12(b). It depicts the weight corrections for an ordered training 
set of inputs consisting of three augmented pattem vectors y x , y 2 , and y 3 . The 
pattems are numbered here in the sequence as they appear at the input of the clas- 
sifier. As shown in previous discussion, weights in the shaded region will provide 
the solution for the augmented weight vector. Three lines labeled ®, ®, and ® are 
obviously fixed for variable weight values that result during the training process 
shown. This is because their normal vectors, pattems y x , y 2 , and y 3 , are fixed. 

Consider training starting at the initial weight of w 1 . During the first step of 
classification, the first prototype pattem of class 1 is presented and misclassified. 
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The new augmented weights after correction become 

w 2 — w 1 -f cy x 

Similarly, when the second pattem is presented for the current weight vector w 2 , 
it is misclassified, and the corrected weight vector w 2 takes the value 

w 3 = w 2 + cy 2 

Since the thiri pattem belongs to class 2, the misclassification happens again in 
this step and the new weight vector results as 

4 3 

w = w - cy 3 

Notice that the discussed training procedure with input y is equivalent to 

■ increasing the discriminant function g( y) as in (3.19) by c||y|| 2 if the 
pattern of class 1 is undetected, and 

■ decreasing the discriminant function by c||y|| 2 if the pattem of class 2 
is uncletected. 

Both the increment and decrement of g( y) take place in such a way that the weight 
vector is displaced in the direction normal to the decision hyperplane. Since this 
change is in tlie direction of the shortest path to the correct classification weight 
area, it is the most appropriate direction. 

The supervised training procedure can now be summarized using the fol- 
lowing expression for the augmented weight vector w' after adjustment 

w' = w ± cy (3.26) 

where the pos tive sign in (3.26) applies for undetected pattem of class 1, and the 
negative sign applies for undetected pattem of class 2. If a correct classification 
takes place ur der this rule, no adjustment of weights is made. 

The reader can notice that weight correction formula (3.26) is the same as 
the perceptror leaming rule (2.34-5). Accurate correspondence of both training 
methods requires that the leaming constant in (2.34—5) is a half of the correction 
increment c in (3.26), and that x used in (2.34—5) is the augmented pattem vector. 
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TRAINING AND 

CLASSIF ICATION USING THE 

DISCRETE PERCEPTRON: 

ALGORITHM AND EXAMPLE 

Let us lcok now in more detail at the weight adjustment aspects. Again 
using the geometrical relationship, w.e choose the correction increment c so that 
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the weight adjustment step size is meaningfully controlled. The distance p of 
a point w 1 from the plane w'y = 0 in (n + l)-dimensional Euclidean space is 
computed according to the formula (see Appendix): 


P 


w lr y 

llyll 


(3.27a) 


where the sign in front of the fraction is chosen to be the opposite of the sign of 
the value of w„ +1 . A simpler rule is that the sign must be chosen to be identical 
to the sign of w lr y. Since p is always a nonnegative scalar by definition of the 
distance, expression (3.27a) can be rewritten using the absolute value notation 
as follows: 


_ |w 1? y| 

P llyll 


(3.27b) 


Let us now require that the correction increment constant c be selected such 
that the corrected weight vector w 2 based on (3.26) dislocates on the decision 
hyperplane w u y = 0, which is the decision hyperplane used for this particular 
correction step. This implies that 


w 2 'y = 0, or 
(w 1 ± cy)'y = 0 


(3.28) 


and the required correction increment results for this training step as 


c = =F 


w u y 

y'y 


(3.29a) 


Since the correction increment c is positive, (3.29a) can be briefly rewritten as 

(3.29b) 


c = 


w u y 

y'y 


The length of the weight adjustment vector cy can now be expressed as 


INI “ (3.30) 

Noting that y'y = ||y|| 2 leads to the conclusion that, as required, the distance 
p from the point w 1 to the decision plane and expressed by (3.27) is identical 
to the length of the weight increment vector (3.30). For this case the correction 
increment c is therefore not constant and depends on the current training pattem 
as expressed by (3.29). 

The basic correction rule (3.26) for c = 1 leads to a very simple adjustment 
of the weight vector. Such adjustment alters the weight vector exactly by the 
pattem vector y. Using the value of the correction increment calculated in (3.29) 
as a reference, several different adjustment techniques can be devised depending 
on the length of the weight correction vector w 2 — w 1 . This length is proportional 
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Locus of ' 
adjusted 4 
• f weights y 


^=1 
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\ 

\X=0 


Pattern i 
of class 1 


Figure 3.13 II ustration of correction jncrement value, fth step. 


to a coefficient A > 0, where 


c = A J — (3.31) 
y'y 

It folio ws fre m the preceding discussion of geometrical terms that A is the ratio 
of the distance between the old weight vector w 1 and the new weight vector w 2 , 
to the distance from w 1 to the pattem hyperplane in the weight space. 

The different choices for correction inerement values c are depicted in Fig- 
ure 3.13. No weight adjustment in the fth step takes place for A = 0. For 
A = 1 the corrected weights displace exaetly onto the decision plane w'y* = 0 
as stated before in case (3.30). For A = 2, the corrected weights are reflected 
symmetrically with respect to the decision plane. In the two intermediate cases 
we have 

0 < A < 1 fractional correction rule 

1 < A < 2 absolute correction rule 

The reader may notice that the fractional correction rule, although it moves 
w / in the right direction, results in another misclassification during the repeated 
demonstration of the same training pattem, unless the weights have moved to the 
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appropriate side of its decision plane due to the other corrections that might have 
taken place in the meantime. For the fixed correction rule introduced originally 
with c = const, correction of weights is always the same fixed portion of the 
current training vector y. Let us also note that the training algorithm based on the 
computation of distance ||y|| requires that the initial weight vector be different 
from 0. This is needed to make the denominator of (3.31) of nonzero value. 


EXAMPLE 3.3 ■“"■"""■"■■i™ 

Let us look at a simple, but nonetheless, instructive example of nonparamet- 
ric training of a discrete perceptron. The trained classifier should provide the 
following classification of four pattems x with known class membership d 

x x = 1, x 3 = 3, di = d 3 ~ 1 : class 1 
x 2 = -0.5, x 4 = —2, d 2 = d 4 = —1 : class 2 

The single discrete perceptron with unknown weights w x and w 2 as shown 
in Figure 3.14(a) needs to be trained using the augmented input vectors as 
below 



Decision lines w'y, = 0, for i = 1, 2, 3, 4, have been sketched in the 
augmented weight Space as illustrated in Figure 3.14(b), along with their cor- 
responding normal vectors, which are simply pattems y h for / = 1, 2, 3, 4. 
The shaded area in the weight plane represents the anticipated solution 
region for the given supervised training task. The shaded area of weight 
solutions is known to us based on the similar considerations as in Figure 
3.11. The classifier weights, however, must yet be trained to relocate possibly 
to the shaded area in this example. 

Let us review the training with an arbitrary selection of c = 1, and 
with the initial weights chosen arbitrarily as w 1 = [-2.5 1.75]*. Using 

(3.26), the weight training with each step 1 through 10 and illustrated in 
Figure 3.14(b) can be summarized as follows 

Aw * = \ [ dk ~ s § n ( w %)] y k 

During the training we obtain the following outputs and weight updates. 
Step 1 Pattem is input 

= sgn([-2.5 1.75] | ) = -1 

di øj — 2 


Figure 3.14a, b 

Discrete perceptron classifier training in Example 3.3: (a) network diagram 

(b) fixed correci 

:ion rule training. 

Pat 

tem vector y 1 = y x is added to the present weight vector w l 


2 1 i i — 1.5 

w = w + y 1 = _ __ 

J L 2.75. 

Step 2 

Pattem y 2 is input 


0 2 = sgn — 1.5 2.75] ”^ 5 ) = 1 
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(3-5) (6-10) (11) (0) (2) (1) 

e X X O X ' X-B X-X B 

-2 -03 \ 0 1 3 *i 

-V, *2 \ X, X 3 

Final decision point pattem space 

(c) 

Figure 3.14c Discrete perceptron classifier training in Example 3.3 (continued): (c) decision 
surfaces. 


Pattem vector y 2 = y 2 is subtracted from the present weight vector w 2 


3 2 

W = W 


-1 

1.75 


Step 3 


Pattem y 3 is input 


o 3 = sgn 




= -1 


^3 °3 — % 

Pattem vector y 3 = y 3 is added to the present weight vector w 3 


w 4 - w 3 + y 3 


2 

2.75 


Since we have no evidence of correct classification for weights w 4 , the 
training set consisting of an ordered sequence of pattems y l9 y 2 , and 
y 3 needs to be recycled. We thus have y 4 = y l5 y 5 = y 2 , etc. The 
superscript is used to denote the following training step number. 
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Steps 4,5 w 6 = w 5 = w 4 . (No misclassification, thus no weight adjust- 
ment.) 

The reader can easily verify that the adjustments following in Steps 
6 tlirough 10 are as follows: 

w 7 = [2.5 1.75]' 

in Q 8 7 

W = w = w = w 
w 11 = [3 0.75]' 

As indicated, only Steps 6 and 10 have resulted in weight adjustment 
during continued training with recycled inputs. Since after ten training 
steps the weight vector w 11 ended up in the solution area, w x = 3, 
and w 2 = 0.75 represent the final weights of the classifier from Figure 
3.14(a) and the training terminales. The reader can verify that w 11 
represents the solution weight vector and provides correct classification 
for the entire training set. 

The training has been so far performed in the augmented weight space 
as showi in Figure 3.14(b). It should be realized, however, that the original 
decision surfaces used by the TLU element are generated in the pattem 
space. Figure 3.14(c) shows the decision surfaces produced for the example 
training for each of Steps 1 through 10. In the augmented pattem space y u 
y 2 , the decision surfaces are the straight lines shown; in the pattem space, 
the deci fion surfaces reduce to points on the x x axis. h 

We now simulate a fixed-correction training rule for a case of higher dimen- 
sion. An example with n = 3 and R = 2 is shown in Figure 3.15. As illustrated 
in the first eight rows of Figure 3.15(a), eight binary pattems spanned in three- 
dimensional space and representing one of the two classes are submitted to the 
input of the classifier in an ascending binary sequence. As listed in the rightmost 
column, four pattems located on the plane x 3 = 0, which is the bottom of the 
unity side cu be in E 3 space, belong to class 2; the remaining four pattems belong 
to class 1. In the calculations shown, we also assume that the perceptron can 
respond with zero. Such “no decisive response” is interpreted as a mistake and 
is folio wed t y a correction. 

Accordi igly, starting from the initial weight vector w = 0 as shown in 
the first row of iteration 1, the first corrected weight vector in the second row 
becomes the difference of the current weight vector [ 0 0 0 0 ] ' and the mis- 

classified pattem vector [0 0 0 1 ]'. The net values at the input of the TLU 

expressing scalar products w*y are shown in the rightmost column of the figure. 
As shown, the training takes only three full sweeps through the set of prototypes, 
since the pattems are apparently easily separable. The resulting decision plane 
shown in 3.15(c) is 4x 3 — 1=0. The equation has been obtained by inspection 
of the last row of the iteration 3 of Figure 3.15(a). 


3.5 TRAINING AND CLASSIFICATION USING THE DISCRETE PERCEPTRON: ALGORITHM AND EXAMPLE 


127 


Input data and classes 

x \ x 2 x 3 x 4 d 
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0 
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1 2 






0 

0 

1 

1 1 
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1 

0 

1 2 






0 

1 

1 

1 1 






1 

0 

0 

1 2 
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0 

1 

1 1 






1 

1 

0 

1 2 






1 

1 

1 

1 1 






Training Steps 

*1 

w 2 

w 3 

W A 

net = w f y 

Iteration 

1 — 

1 






O 

0 

0 

1 

0.00 

0.00 

0.00 

0.00 

0.00 

0 

0 

1 

1 

0.00 

0.00 

0.00 

- 1.00 

- 1.00 

0 

1 

0 

1 

0.00 

0.00 

1.00 

0.00 

0.00 

0 

1 

1 

1 

0.00 

- 1.00 

1.00 

- 1.00 

- 1.00 

1 

0 

0 

1 

0.00 

0.00 

2.00 

0.00 

0.00 

1 

0 

1 

1 

- 1.00 

0.00 

2.00 

- 1.00 

0.00 

1 

1 

0 

1 

0.00 

0.00 

3.00 

0.00 

0.00 

1 

1 

1 

1 

- 1.00 

- 1.00 

3.00 

- 1.00 

0.00 

Iteration 

1 = 

2 






0 

0 

0 

1 

0.00 

0.00 

4.00 

0.00 

0.00 

0 

0 

1 

1 

0.00 

0.00 

4.00 

- 1.00 

3.00 

0 

1 

0 

1 

0.00 

0.00 

4.00 

- 1.00 

- 1.00 

0 

1 

1 

1 

0.00 

0.00 

4.00 

- 1.00 

3.00 

1 

0 

0 

1 

0.00 

0.00 

4.00 

- 1.00 

- 1.00 

1 

0 

1 

1 

0.00 

0.00 

4.00 

- 1.00 

3.00 

1 

1 

0 

1 

0.00 

0.00 

4.00 

- 1.00 

- 1.00 

1 

1 

1 

1 

0.00 

0.00 

4.00 

- 1.00 

3.00 

Iteration 

1 = 

3 






O 

0 

0 

1 

0.00 

0.00 

4.00 

- 1.00 

- 1.00 

0 

0 

1 

1 

0.00 

0.00 

4.00 

- 1.00 

3.00 

0 

1 

0 

1 

0.00 

0.00 

4.00 

- 1.00 

- 1.00 

0 

1 

1 

1 

0.00 

0.00 

4.00 

- 1.00 

3.00 

1 

0 

0 

1 

0.00 

0.00 

4.00 

- 1.00 

- 1.00 

1 

0 

1 

1 

0.00 

0.00 

4.00 

- 1.00 

3.00 

1 

1 

0 

1 

0.00 

0.00 

4.00 

- 1.00 

-l.oo' 

1 

1 

1 

1 

0.00 

0.00 

4.00 

- 1.00 

3.00 


The training took 2 iterations. 

The final weights are: 0.00 0.00 4.00 -1.00. 



is pattern vector. 


*4 = 1 is augmented pattern component, d is desired class membership. 


(a) 

Figure 3 . 15 a Example dichotomizer training, n = 3: (a) fixed correction rule (c = 1). 
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1 

1 
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0 
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0.00 
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0 

1 

1 

1 
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0 
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1.13 
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1 

0 
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1 


0.00 

0.00 
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-0.38 

0.75 

1 

1 
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0.00 

0.00 
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1 

1 

1 

1 


0.00 

0.00 
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The training took 1 iteration. 


The final weights are: 0.00 0.00 1.13 -0.38. 



X 4 = 1 is augmented pattem component, d is desired class membership. 

(b) 

Figure 3.15b Example dichotomizer training, n = 3 (continued): (b) absolute correction rule. 

Similar results, but produced within only two sweeps through the training 
set, have been obtained for the absolute correction rule. The resulting plane is 
1.13*3 — 0.38 = 0 for the weight computation using this rule. The simulated 
training is shown in Figure 3.15(b), and the resulting decision plane is shown in 
Figure 3.15(c). 
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Figure 3.15c Example dicholomizer training, n = 3 (continued): (c) decision planes for parts 
(a) and (b). 


Let us summarize the classification of linearly separable pattems belonging 
to two classes only. The training task for the classifier was to find the weight 
vector w such that 


(3.32) 


w'y > 0 for each x of åd, 
w'y < 0 for each x of 

Completion of the training with the fixed correction training rule for any ini- 
tial weight vector and any correction increment constant leads to the following 
weights 


Yy ■7’ = yy ? ' = yv ^ — W ■ 


(3.33) 


with w* as a solution vector for (3.32). The integer k Q is the training step number 
starting at which no more misclassification occurs, and thus no weight adjust- 
ments take place for k 0 ^ 0. This theorem, called the perceptron convergence 
theorem, States that a classifier for two linearly separable classes of pattems is 
always trainable in a finite number of training steps. Below we provide the proof 
of the theorem. 
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Assuine that the solution weight vector w* exists, and is normalized such 
that ||w*|| = 1. The existence of hypothetical solution weights w* allows rewrit- 
ing of (3.32) for a small arbitrarily selected constant 0 < 8 < 1 as follows 

w*'y > 8 > 0 for each x e 2C l 
w*'y < — 5 < 0 for each xeS 2 
Let us evali ate the following function: 

*t 

. , , A W 'W 


Since <£( w) is a scalar product of normalized vectors, </>(w) = cos <(w*, w) ^ 1. 
Rearranging the numerator and denominator of (3.35) we obtain for the fe’th 
leaming step, respectively, 


*t k + 1 

w w 


w*'w* + w*'y > w* r w* + 8 


= (w fa + y')(w* + y) < ||w Æ || 2 + 1 


(3.36a) 

(3.36b) 


for the normalized pattem used for training and for c = 1. 

Upon performing the total of k 0 training steps (3.36a) and (3.36b) can be 
rewritten, respectively, as 

W *V* +1 > k 0 8 (3.36c) 


This condition can be rewritten as follows: 

|| w 4 + l ||2 < ^ 

The function (3.35) now becomes for w = w k 




(3.36d) 


(3.37a) 


and from (3.36c) and (3.36d) note that 

<t>{ w to+1 ) > y/f 0 8 (3.37b) 

Since <f>(w kv +1 ) has an upper bound of 1, the inequality (3.37b) would be violated 
if a solution were not found for \/ j/ k a 8 < 1. Thus, this concludes the proof that 
the discrete perceptron training is convergent. The maximum number of leaming 
steps is k 0 == 1/S 2 (Amit 1989). If the value 1/S 2 is not an integer, it should be 
rounded up to the closest integer. 

The number of steps k 0 needed to achieve correct classification depends 
strongly on the c value and on the sequence of training pattems submitted. The 
theorem can also be extended and proven for R category machines classifying 
pattems that are linearly separable and trained through submission of such pat- 
tems in any sequence (Nilsson 1965). 

The perceptron training procedure expressed in (3.26) may be rewritten as 

W *+1 = w * + £(</* _ </)y* (3.38) 
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where k denotes the number of the current training step, o k is the actual output 
of the classifier being trained, and d k is the teacher-submitted desired output for 
the vector y k applied at the input. When d k = o k , no adjustment of weights is 
undertaken. When d k — 1 and o k = — 1, signifying jointly that the pattem of 
class 1 was mistakenly not detected, the weight correction becomes 

w* +1 -w k = cy k (3.39a) 

Whenever a pattem of class 2 is not detected, and the classification is resulting 
in o k = 1 while d k — — 1, the weight correction is 

w* +1 - w * = -cy k (3.39b) 

In summary, the training of a single discrete perceptron two-class classifier 
requires a change of w if and only if a misclassification occurs. If the reason 
for misclassification is Wy < 0, then all weights w t are increased in proportion 
to yp if w f y > 0, then all weights w t are decreased in proportion to y t . The 
weights are then punished in an appropriate way. If the classification is correct, 
no punishment takes place, but its absence can be taken as a reward. We can 
formulate the single discrete perceptron training algorithm as given below. 

The error value E used in the algorithm is computed to detect the series of 
correct classifications within the training cycle and to terminate the training. Left- 
pointing arrows below indicate the operation of an assignment. By the operation 
of assignment, the argument on the left side is assigned the value indicated on 
the right side. 


mm Summary of the Single Discrete Perceptron Training Algorithm (SDPTA) 
Given are P training pairs 

{x l ,d l ,x 2 ,d 2 ,. . . y Xp 9 dp} 9 where 
x / - is ( n X 1), di is (1 X 1), / = 1,2,... P. 


Note that augmented input vectors are used: 


y *• = 


for i = 1, 2, . . . , P 


In the following, k denotes the training step and p denotes the step 
counter within the training cycle. 


Step 1 : c > 0 is chosen. 

Step 2: Weights are initialized at w at small random values, w is 
(n + 1) X 1. Counters and error are initialized: 

k 1, p <— 1, E <— 0 


Step 3: The training cycle begins here. Input is presented and output 
computed: 


y <- y p,d<-d p 

o sgn (w f y) 
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Step 4: Weights are updated: 

w w + }jc(d — o) y 
Step 5: Cycle error is computed: 

E^ l -{d-of + E 

Step 6: If p < P then p *— p + \, k *— k + \, and go to Step 3; 
ott erwise, go to Step 7. 

Step 7 : The training cycle is completed. For E = 0, terminate the 
training session. Output weights and k. 

If E > 0, then E *— 0, p <— 1, and enter the new training 
cycle by going to Step 3. 


The training rule expressed by the Equations (3.26) and (3.38) will be ex- 
tended in the next section to cases where the TLU or the threshold device is 
no longer the most favored decision computing element. Instead, a continuous 
perceptron element with sigmoidal activation function will be introduced to facil- 
itate the training of multilayer feedforward networks used for classification and 
recognition. 

3.6 ■ 

SINGUE-LAYER CONTINUOUS 

PERCE PTRON NETWORKS 
FOR LINEARLY SEPARABLE 
CLASSIFICATIONS 

In this section we introduce the concept of an error function in multidi- 
mensional weight space. Also, the TLU element with weights will be replaced 
by the continuous perceptron. This replacement has two main direct objectives. 
The first om is to gain finer control over the training procedure. The other is 
to facilitate working with differentiable characteristics of the threshold element, 
thus enabling computation of the error gradient. Such a continuous characteristic, 
or sigmoidal activation function, describes the neuron in lieu of the sgn function 
as discussed in Chapter 2. 

Accord ng to the training theorem discussed in the last section, the TLU 
weights are converging to some solution w k ° for any positive correction increment 
constant. The weight modification problem could be better solved, however, by 
minimizing the scalar criterion function. Such a task can possibly be attempted 
by again using the gradient, or steepest descent, procedure. 
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Figure 3.16 Continuous perceptron training using square error minimization. 


The basie procedure of descent is quite simple. Starting from an arbitrary 
chosen weight vector w, the gradient V£(w) of the eurrent error function is 
computed. The next value of w is obtained by moving in the direction of the 
negative gradient along the multidimensional error surface. The direction of neg- 
ative gradient is the one of steepest descent. The algorithm can be summarized 
as below: 

w t+ 1 = w* - rjVE(vf k ) (3.40) 


where 17 is the positive constant called the learning constant and the superscript 
denotes the step number. 

Let us define the error E k in the k ’ th training step as the squared difference 
between the desired value d k at the output of the continuous perceptron and its 
actual output value o k computed. As shown in Figure 3.16, the desired value d k 
is provided by the teacher. 

The expression for classification error to be minimized is 


E k = l -{d k - o k )\ or 



(3.41a) 

(3.41b) 


where the coefficient 1 / 2 in front of the error expression is intended for conve- 
nience in simplifying the expression of the gradient value, and it does not affect 
the location of the error minimum or the error minimization itself. 

Our intention is to achieve minimization of the error function E( w) in (n+ 1)- 
dimensional weight Space. An example of a well-behaving error function is shown 
in Figure 3.17. The error function has a single minimum at w = Wy, which can 
be achieved using the negative-gradient descent starting at the initial weight 
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Figure 3.17 Minimization of the error function: (a) error surface (n = 1) and (b) error contours 
for determination of the gradient vector VE. 
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vector 0 shown in the figure. The vicinity of the point E(y? f ) = 0 is shown to 
be reached within a finite number of steps. The movement of the weight vector 
can be observed both on the error surface in Figure 3.17(a), or across the error 
contour lines shown in Figure 3.17(b). In the depicted case, the weights displace 
from contour 1 through 4 ideally toward the point £ = 0. By definition of the 
steepest descent concept, each elementary move should be perpendicular to the 
current error contour. 

Unfortunately, neither error functions for a TLU-based dichotomizer nor 
those for more complex multiclass classifiers using TLU units result in a form 
suitable for the gradient descent-based training. The reason for this is that the 
error function has zero slope in the entire weight space where the error function 
exists, and it is nondifferentiable on the decision surface itself. This property 
directly results from calculating any gradient vector component VE( w) using 
(3.41) with sgn (net) replacing f(net). Indeed, the derivative of the intemal func- 
tion sgn(-) is nonexistent in zero, and of zero value elsewhere. 

The error minimization algorithm (3.40) requires computation of the gradient 
of the error (3.41) as follows: 

V£(w) = ~V [d-f(net)] 2 (3.42) 

The training step superscript k in (3.42) has been temporarily skipped for simplic- 
ity, but it should be understood that the error gradient computation refers strictly 
to the £’th training step. In the remaining part of this section, the gradient descent 
training rule for the single continuous perceptron will be derived in detail. 

The n 4- 1 -dimensional gradient vector (3.43) is defined as follows: 


V£(w) = 


dE 

d\V x 

dE 

dw 2 


dE 

dWn+l 


Using (3.42) we obtain for the gradient vector 


d(net) 
dw x 
d(net ) 
dw 2 

d(net) 

dw«+l 


(3.43) 


V£(w) = ~(d - o)f'(net ) 


(3.44) 
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Since net = 


|w'y we have 
d(net) 


and (3.44) c^n be rewritten as 

V£(w) = ~(d - o)f(net) y 


or 


which is the 
rule is equiv 
weights as i 
activation fu 
Note 
we will no Id 
as a variable 
any other coi 
Let us 
bipolar conti 


thai 


we obtain 


dw ( 


= y h for i = 1 , 2, . . . , n + 1 


dE 

dw ( 


= ~(d - o)f , (net)y i 


(3.45) 

(3.46a) 

(3.46b) 


training rule of the continuous perceptron. It can be seen that the 
alent to the delta training rule (2.38). The computation of adjusted 
i (3.40) requires the assumption of rj and the specification for the 
netion used. 

t in further considerations we assume trainable weights. Therefore, 
nger need to use the steepness coefficient A of the activation funetion 
. The assumption that A = 1 is thus as valid as the assumption of 
nstant A value used to scale all the weights in the same proportion, 
express f'(net) in terms of continuous perceptron output. Using the 
inuous activation funetion f{net) of the form 

f(net ) = - , - 1 (3.47) 


finet) 


1 4* exp(— net) 


_ 2exp(— net) 


[1 + exp(- 

A useful idebtity (3.49) can be applied here: 


-net)] 2 


2exp (—net) 


5 ( i -° 2) 


(3.48) 


(3.49) 


[1 + exp (-net)] 2 

The identity (3.49) is verified below. Letting o = f (net) as in (3.47) on the right 
side of (3.49) leads to 


- exp(-fleOV 

+ exp (—net) J 

The right siile of (3.50) reduces after rearrangements to 




-G 


1 

2 

iplletes 


1 


G 


exp(— net)\ 2 


4* exp (—net) J 
the verification of (3.49). 


2 exp (—net) 

[1 + exp (—net)] 2 


which com] 

The gradient (3.46a) becomes 


1 


V£(w) = - -(d - o)( 1 


o 2 )y 


(3.50) 


(3.51) 


(3.52) 
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and the complete delta training rule for the bipolar continuous activation function 
results from (3.40) as 

w* +i = w* + - o k )( 1 - o k2 )y k (3.53) 

where k denotes the reinstated number of the training step. 

It may be noted that the weight adjustment rule (3.53) corrects the weights in 
the same direction as the discrete perceptron leaming rule originally formulated 
in (3.38). The size, or simply length, of the weight correction vector is the main 
difference between the rules of Equations (3.38) and (3.53). Both these rules 
involve adding or subtracting a fraction of the pattem vector y. The essential 
difference is the presence of the moderating factor 1 — o k2 . This scaling factor is 
obviously always positive and smaller than 1. For erroneous responses and net 
close to 0, or a weakly committed perceptron, the correction scaling factor will 
be larger than for those responses generated by a net of large magnitude. 

Another significant difference between the discrete and continuous percep- 
tron training is that the discrete perceptron training algorithm always leads to a 
solution for linearly separable problems. In contrast to this property, the negative 
gradient-based training does not guarantee solutions for linearly separable pattems 
(Wittner arid Denker 1988). 

At this point the TLU-based classifier has been modified and consists of a 
continuous perceptron element. In the past, the term perceptron has been used to 
describe the TLU-based discrete decision element with synaptic weights and a 
summing node. Here, we have extended the perceptron concept by replacing the 
TLU decision element with a neuron characterized by a continuous activation 
function. As in the case of a discrete perceptron, the pattem components arrive 
through synaptic weight connections yielding the net signal. The net signal excites 
the neuron, which responds according to its continuous activation function. 


EXAMPLE 3.4 


This example revisits the training of Example 3.3, but it discusses a con- 
tinuous perceptron rather than a discrete perceptron classifier. The training 
pattem set is identical to that of Example 3.3. The example demonstrates 
supervised training using the delta rule for a continuous perceptron, which is 
shown in Figure 3.16. As has been shown, the training rule (3.53) minimizes 
the error (3.41) in each step by following the negative gradient direction. 
The error value in step k is 



2 

1 4- exp(— Xnet k ) 



} 


2 
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where net k = w kt y k . Note that for fixed pattems y* = y*, for k = 1, 2, 
3, 4, the error can be expressed as a function of two weights only. Thus, 
we can easily track the error minimization procedure during training in the 
two-dimensional weight space. The error function in the first training step 
can be obtained using numerical values of Example 3.3 plugged into the 
error expression above. For the first pattem being input we have 


£i(w) 


-H-[r 


2 l LI + exp[-\(w x + w 2 )] JJ 

Plugging A — 1 and reducing the terms simplifies the above expression 
for error to the form as below 

E 2 
lW [1 + expOj + w 2 )] 2 

for the first training step. Subsequent Steps 2, 3, and 4 need to reduce the 
training step error values, respectively, as follows: 

£ 2(W) = M T _, n 2 g „, —TU 


E,( W) = 


z [1 + exp(0.5w! ~ w 2 )] 2 

^ 3 ( w ) jj + exp(3w, + w 2 )] 2 

[1 + exp (2 w x - w 2 )] 2 

Duiing training, the error E k in Æ’th training step is minimized as a 
function of weights at w = w\ The error surfaces E x through E 4 are shown 
in Figure 3.18. It can be seen from the figure that each training step error is 
monoton ically decreasing and it asymptotically reaches 0 at decision regions 
yielding correct classification. 

Although the error surfaces clearly visualize the anticipated displace- 
ment of w* toward the lower error values using the steepest descent tech- 
nique, reconstruction of the trajectories of trained weights for the sequence 
of steps is somewhat complex. Adding eirors E x through E 4 provides better 


overall ipsight into the error minimization process. The total error surface 
is shown in Figure 3.19(a). The total error monotonically decreases and 
reaches values near 0 in the shaded region of Figure 3.14(b). A contour 
map of l:he total error is depicted in Figure 3.19(b). 

The classifier training has been simulated for rj — 0.5 for four arbitrar- 
ily chosen initial weight vectors, including the one taken from Example 3.3. 
The resulting trajectories of 150 simulated training steps are shown in Figure 
3.19(c) (each tenth step is shown). In each case the weights converge during 
training toward the center of the solution region obtained for the discrete 
perceptron case from Figure 3.14(b). ■■ 




(C) (d) 

Figure 3.18 Error functions for individual pattems, Example 3.4: (a) first pattern, E h (b) second 
pattern, E 2 , (c) third pattern, and E 3 , and (d) fourth pattem, E 4 . 
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We can formulate the single continuous perceptron training algorithm as 
given below. 

mm Summaiy o[ the Single Continuous Perceptron Training Algorithm (SCPTA) 

Given the same data as for SDPTA in Section 3.5: 

Step 1 : 17 > 0, A = 1, £ max > 0 chosen. 

Step 2: Same as SDPTA in Section 3.5. 

Step 3: The training cycle begins here. Input y is presented and 
output computed: 

y«-y P’d^-d p 

o <— /(w'y), with f(net) as in (2.3a) 

Step 4: Weights are updated: 

1 , 
w <— w + 2 T K < ^ - o)(l - o )y 
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Steps 5 6: Same as SDPTA in Section 3.5 

Step 7 : The training cycle is completed. For E < £ max terminate the 
training session. Output weights, k and E. 

If £ > £ max , then E <— 0, p *— 1, and enter the new training cycle 
by going to Step 3. 


3.7 

MULTICATEGORY 
SINGLE-LAYER 
PERCEPTRON NETWORKS 

Our approach so far has been limited to training of dichotomizers using 
both discrete and continuous perceptron elements. In this section we will attempt 
to apply the e rror-correcting algorithm to the task of multicategory classification. 
The assumption needed is that classes are linearly pairwise separable, or that each 
class is linea *ly separable from each other class. This assumption is equivalent 
to the faet th.it there exist R linear discriminant funetions such that 

gi(x) > gj(x), for /, j = 1 , 2, i * j 

Let us devise a suitable training procedure for such an Æ-category classifier. To 
begin, we need to define the augmented weight vector as 

W 9 = [w q l W q 2 ... W 9> „ +1 ]' 

Assume that an augmented pattem y of class i is presented to the maximum 
selector-based classifier as in Figure 3.7. The R decision funetions w^y, w^y, . . . , 
w^y are evaluated. If w|y is larger than any of the remaining R — 1 discriminant 
funetions, no adjustment of weight vectors is needed, since the classification is 
correct. This indicates that 

wi - Wj 


W2 = W 2 

W* = W* 


(3.54a) 


The prime is again used to denote the weights after correction. If, however, for 
some m value we have w|y < w^y, then the updated weight vectors become 

w; = w,- + cy 

K, = w m - cy (3.54b) 

w* = w*, for k = 1, 2, . . . , /?., k =£ i, m 
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Figure 3.20 The biased neuron: (a) simplified diagram and (b) neuron’s response for nonzero 
threshold. 


The matrix formula (3.54b) can be rewritten using the double-subscript notation 
for the weights used in Figure 3.7 as follows: 

w\j = w tj + cyj for j = 1, 2, . . . , n + 1 

w mj = w mj - tyj for j = 1, 2, . . . , /i + 1 (3.54c) 

w kj = w kj for k = 1, 2, . . . , R, k + i, m, j = 1, 2, . . . , n + 1 

It can be seen from Eq. (3.54c) that the weight value of the connection between 
the i’th output and the / th component of the input is supplemented with cyj 
if the fth output is too small. The weight value toward the m’th output from the 
/th component of the input is reduced by cyj if the m’th output is excessive. 

So far in this chapter we have been dealing with the augmented pattem 
vector y with the ( n 4* l)’th component always of fixed value +1. It is somewhat 
instructive to take a different look at this fixed component of the pattem vector 
to gain better insight into the perceptron’s thresholding operation. Figure 3.20(a) 
shows the perceptron-based dichotomizer in which the augmented pattem com- 
ponent is y n+ i = — 1. Compared with the diagram from Figure 3.16, the weight 
values w n+l on both figures have to be of opposite values to yield identical 
decision functions. Since the weights are iteratively chosen during the training 
process anyway, it is not relevant at all which of the two perceptron diagrams 
is adopted for the weight training. It becomes thus irrelevant whether y n+l is 
assumed of value +1 or —1. 

Let us assume now that the neuron is excited as in Figure 3.20(a). We can 
now write that 


net = w'x - w n+l 


(3.55) 
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Denoting w n3l as the neuron activation threshold, or bias, value equals T 


it becomes obvious that the output of the neuron can be expressed as 


f(net) 


= { >0 
l<0 


for w'x > T 
for w'x < T 


The activatio 
argument is s 
a neuron witi 
mented activ* 
value is equa 
of the origina 

Otherwis 
the neuron to 
to stress agai 
appropriate cl 
the actual firi 
In further cor 
stated. 

So far o 
the dichotomi 
useful in bui 
distance class 
is needed. 

Let us nc 
fier as in Figi 
generated thh 
3.21. The pro 
when ^(x) h 
i = 2, 3, , 

by the single- 
Figure 3.21 n 
should indica 

The adji 
of w l n+l thr< 
clearly feasib 
in < S£ l . The ac 
at the input c 
during the T , 
to the remain 
using R indiv 


ti function expressed now in the form of /( w'x) with w'x as an 
tiown in Figure 3.20(b). The figure shows an activation function of 
i the positive threshold value, T > 0, and sketched versus nonaug- 
ition w'x. It is instructive to notice that now the ( n + l)’th weight 
to the threshold 7\ and the neuron is excited if the weighted sum 
1, unaugmented pattem exceeds the threshold value T. 
e, the neuron remains inhibited. The nonzero threshold value causes 
behave as a biased device with T being its bias level. It is important 
n that from the training viewpoint, any value y n+l = const is an 
loice. When y rt+1 = — 1, however, the w n+l value becomes equal to 
ag threshold of the neuron with input being the original pattem x. 
siderations we will use y n+l = -1 and w n+l — T unless otherwise 

ur discussion has focused on a continuous perceptron performing 
zer-like function. We now show that the continuous perceptron is 
ding any linear classifier. We begin by modifying the minimum 
ifier so that it uses discrete perceptrons and no maximum selector 

>w try to eliminate the maximum selector in an R- class linear classi- 
ire 3.7 and replace it first with R discrete perceptrons. The network 
» way is comprised of R discrete perceptrons as shown in Figure 
perly trained classifier from Figure 3.7 should respond with i 0 = 1 
larger than any of the remaining discriminant functions g/(x), for 
/?, for all pattems belonging to Instead of signaling the event 
output maximum selector responding with i 0 = 1, a TLU #1 as in 
lay be used. Then, outputs o l = 1 and o 2 = o 3 = . . . = o R = —1 
te category 1 input. 

istment of threshold can be suitably done by altering the value 
>ugh adding additional threshold value T { , Such an adjustment is 
le since ^(x) exceeds all remaining discriminant values for all x 
Ijustment is done by changing only the single weight value w 1/z+1 
f the first TLU by T x . None of the remaining weights is affected 
setting step with the input of class 1. Applying a similar procedure 
ing inputs of the maximum selector from Figure 3.7, the classifier 
idual TLU elements as shown in Figure 3.21 can be obtained. For 
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this classifier, the Uth TLU response of + 1 is indicative of class k and all other 
TLUs respond with -1. 

The training procedure outlined in (3.54) uses the maximum selector net- 
work which will be covered in Chapter 7 in more detail. For our neural network 
studies however, architectures with the TLU elements are much more important 
than those with maximum selectors. As shown in the previous paragraphs, the 
approach can be used to devise the multicategory perceptron-based classifier by 
replacing the maximum selector through R threshold logic units. Altematively, 
direct supervised training of the network of Figure 3.21 can be performed. The 
weight adjustment during the k’th step for this network is as follows: 

W* +1 = wf + c —(df - o\) y\ for / = 1 , 2 ,...,/? (3.58) 

where d t and o t are the desired and actual responses, respectively, of the fth dis- 
crete perceptron. The formula (3.58) expresses the R-category discrete perceptron 
classifier training. For Æ-category classifiers with so-called local representation , 
the desired response for the training pattem of the fth category is 

4 = 1, dj = - 1 , for 7 = 1,2, j±i (3.59) 

For /^-category classifiers with so-called distributed representation , condition 
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(3.59) is not mquired, because as more than a single neuron is allowed to respond 
4- 1 in this mc 
We can 
given below. 

TLU elements! 


qde. 

fjormulate the Æ-eategory discrete perceptron training algorithm as 
Local representation is assumed, thus indicating that R individual 
are used for this Æ-category classifier. 


Summary 


of the R-Category Discrete Perceptron Training Algorithm (RDPTA) 


Give|n are P training pairs 

{xi,d 1 ,x 2 ,d 2 , . . . ,Xj»,dp}, 

lis (n X 1), d f is (R X 1), / = 1,2,..., P. Note that augmented 


where x t 
input vectors are used: 

y i = 


-i 


for i = 1, 2, . . . , P 


and k denotes the training step; p denotes the step counter within the 
training cycle. 


Step 1 : 
Step 2: 

Ky] 


Same as SDPTA in Section 3.5. 

Weights are initialized at W at small random values, W = 
is R X (n + 1). 


k <— 1, p *— 1, E <— 0 

Step 3: The training cycle begins here. Input is presented and output 
computed: 

y «- y P , d d p 

o i <- sgn(w'y), for i = 1, 2, . . . , R 

wheife w, is the i’th row of W 
Weights are updated: 


Step 4: 


Step 5: 


Steps 6, 


w,- <- w, + ^c(dj - 0 ;)y, for i = 1, 2, . . . , R 
Cycle error is computed: 

1 i 

E *— -(d t — Oj) 2 + E, for i = 1, 2, . . . , R 
7: Same as SDPTA in Section 3.5. 


The exanliple below demonstrates how a multicategory classifier using a 
maximum selector is converted so that it uses only discrete perceptrons. Also, the 
multicategory training procedure for a multi-perceptron classifier is illustrated. 
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EXAMPLE 3.5 


Let us revisit the three-class classifier design problem of Example 3.2. We 
will modify the classifier from Figure 3.8(d) so that it uses three discrete 
perceptrons. This will require changing the threshold values of each unit as 
described earlier in this section. 

It is easy to verify from (3.20a) that the discriminant values ^(x), g 2 (x), 
and g 3 (x) for each of the three prototype pattems are as shown below: 


Discriminant 


Input 


Class 1 
[10 2]' 

Class 2 
[2 -5]' 

Class 3 
[-5 5]' 

£i(x) 

52 

-42 

-92 

#2(x) 

-4.5 

14.5 

-49.5 


-65 

-60 

25 


As required by the definition of the discriminant function, the responses 
on the diagonal are the largest in each column. Inspection of the discrim- 
inant functions (3.20b) yields the thresholds w l 3 , w 2?3 , and w 3 3 of values 
52, 14.5, and 25, respectively. Let us choose additional threshold T { = —2. 
This lowers the weight w lj3 from 52 to 50. This will ensure that the TLU 
#1 only will be responding with + 1 for class 1 input. Choosing similarly T 2 
and T 3 values of —2 allows for TLU #2 and TLU #3 to respond properly. 
The resulting three-perceptron classifier is shown in Figure 3.22(a). 

Instead of using classifiers which are designed based on the minimum- 
distance classification for known prototypes and deriving the network with 
three perceptrons from the form of the discriminant functions, our objective 
in the following part of the example is to train the network of three percep- 
trons using expression (3.58). Randomly chosen initial weight vectors are 


w}- 

' r 
-2 

. w* = 

' 0' 
-1 

, w’ = 

' r 

3 

0 


2 


-1 


Assuming that augmented pattems are presented in the sequence y x , y 2 , 
y 3 , y l9 y 2 , ..., and local representation condition (3.59) holds, we obtain 
(incorrect answers and thus requiring corrections are marked by asterisks) 
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Step 1 : y] is input: 



r io i\ 

1-2 0] 2 j = i 

’ 10 1 \ 

0-1 2 ] 2 J = -1 



r io i\ 

3-1] 2 I = 1* 


Since the only incorrect response is provided by TLU #3, we have for 
c = 1: 





The weight updates are: 



Step 3: y 3 is input: 

sgn(wfy 3 ) = 1* 
sgn(w^y 3 ) = -1 
sgn(Wj'y 3 ) = 1 

The weight updates are: 
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This terminates the first leaming cycle and the third step of weight adjust- 
ment. The reader may verify easily that the only adjusted weights from now 
on are those of the first perceptron. The outcome of the subsequent training 
steps is: 



This outcome terminates the training procedure since none of the last 
three patlem presentations require the adjustment of weights. The three- 
perceptron network obtained as a result of the training is shown in Figure 
3.22(b). It performs the following classification: 

o x = sgnCSjtj + 3x 2 — 5) 

0 2 = sgn (— *2 “ 2 ) 

0 3 = sgn(— 9xj + x 2 ) 

The resulting decision surfaces are shown in Figure 3.22(c). We may notice 
that the tiree-perceptron classifier produces, in faet, triple dichotomization 
of the plane x x , x 2 . Three decision surfaces produced, which are lines in 
this case, are 

5x x + 3x 2 — 5 = 0 
—x 2 — 2 = 0 
9x x 4- x 2 = 0 

The lines are shown in Figure 3.22(c) along with their corresponding nor- 
mal vectors directed toward their positive sides. Note that in contrast to 
the mininfium-distance classifier, this method has produced several indeci- 
sion regions. Indecision regions are regions where no class membership of 
an input gattern can be uniquely determined based on the response of the 
classifier. Pattems in shaded areas are not assigned any reasonable classifi- 
cation. Ohe of such pattems may be Q, arbitrarily selected as an example 
input. The reader can easily verify that the classifier ’s output for input Q 
is o = [ li 1 —iy.lt thus indicates an indecisive response. However, no 
pattems such as Q have been used for training in the example. wm 
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The gradient descent training rule derived for R = 2 is also applicable to the 
single-layer continuous perceptron network as shown in Figure 3.23. The figure 
displays the R class perceptron classifier derived from the TLU-based classifier 
in Figure 3.21. We can now produce the generalized gradient descent training 
rule for a single-layer network of any size. To do this, we combine training rule 
(3.53) with (3.58) as follows: 

w* +l = wf + Itj (</* - of)( 1 - øfV for /' = 1,2,...,/? (3.60) 

The training rule of Eq. (3.60) is equivalent to individual weight adjustment 



for j = 1, 2, . . . , n + 1, and / = 1, 2, . . . , R 

Expressions (3.60) and (3.61) exemplify the delta training rule for a bipolar 
continuous activation function as in (2.3a). Since the formula is of paramount 
importance for the multilayer perceptron network training, it will be covered in 
the next chapter in more detail. At this point we will only mention that the delta 
training rule can be used both for networks with single-layer and for multilayer 
networks, pro vided that continuous perceptrons are used. 

Let us finish with the general diagram of a supervised leaming system as 
shown in Figure 3.24. The discussed training mode can be represented in block 
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d (Teacher) 


Figure 3.24 Supervised learning adaptive mapping system. 

diagram forrp as illustrated. Both TLU-based and continuous perceptron classi- 
fiers discussed in this chapter fali into the diagram of Figure 3.24. The single 
recall step of the system maps the set in the input Space y onto sets in the output 
space o . The error function E(d, o) is being used as a criterion to accomplish the 
supervised learning. Although the definition of error function E is not limited 
to any partier lar form, it is widely accepted that is should be dependent on the 
difference vec tor d — o. The error function has been used to modify the intemal 
parameters (c illed weights) of the adaptive mapping system. 


3.8 


CONCLUDING REMARKS 


The exposition in this chapter has been organized around the assumptions 
of traditional gattern recognition principles. Although the coverage does not fully 
take advantage of the potential that is manifested by more complex neural net- 
work architectures, it provides a good overall base for the discussion of basic 
neural network: concepts. A list of such concepts includes classification, trainable 
classifiers, design of dichotomizers and of R category classifiers, linear separa- 
bility of pattems, perceptron training theorem, negative gradient descent-based 
training, and others. 

In this cjiapter we have reviewed basic concepts of deterministic pattem 
classification. Trainable classifiers have been discussed that use gradient descent 
minimization of the error between the desired and actual response. The gra- 
dient descent-based supervised training rules for single-layer neural classifiers 
have been deiived. Both discrete and continuous perceptron single-layer neural 
classifiers have been discussed and have been designed based on the weight 
space considerations. The training procedures have been illustrated using the 
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geometrical visualization in the weight space. It has also been shown that re- 
placing the sgn (•) function of the TLU element in a discrete perceptron with the 
continuous activation function modifies the leaming by the factor f\net). 

Material covered in this chapter has been largely based on Nilsson (1965) 
and other sources, such as Tou and Gonzalez (1974), Sklansky and Wassel (1981), 
and Young and Calvert (1974). The chapter covers the results of early efforts 
in machine leaming, and it emphasizes the perceptron model of leaming. The 
mathematical concepts that resulted from development of perceptrons continue 
to play a central role in most feedforward neural network leaming and recall. 
Intuitive geometric explanations and mathematical foundations presented in this 
chapter can be viewed as fundamental for multilayer neural network leaming. 

To bridge the traditional leaming concept using discrete perceptrons in 
single-layer networks with multilayer networks, continuous perceptron network 
leaming has been introduced. By replacing the discontinuous threshold operation 
with a continuous activation function, it is possible to compute the mapping error 
gradient in multilayer feedforward networks. The multilayer network training al- 
gorithm has opened new areas of solutions and applications. It also opened new 
questions. They will all be introduced in the following chapter. 


PROBLEMS 


Piease note that problems highlighted with an asterisk (*) are typically com- 
putationally intensive and the use of programs is advisable to solve them. 


P3.1 Write down equations along with the constraint inequalities for the deci- 
sion planes subdividing the cube into four decision regions 3£ l9 . . . , $C 4 as 
shown in Figure P3.1. The pattem vectors’ x memberships in classes are: 


P3.2 


i 0 = 1 for x = [ 1 0 0 ] { , 

i Q = 2 for x = [ 1 0 1 ] 1 , 

i 0 = 3 for x = [ 0 1 0 ] 1 , 

i Q = 4 for x = [ 0 0 0 ] 1 , 


[i i or 
[i i ir 
[0 1 1 ]' 
[o o i] r 


Note that the faces, edgés, and vertices here belong to the pattem space. 
Whenever the decision surface intersects the edge of the cube, the inter- 
section point should halve that edge (see Appendix for appropriate form 
of plane equations). 


A commonly used method for classification is template matching. Tem- 
plates in this problem are unipolar binary vectors v 0 , v l9 v 8 , and v 9 
corresponding to the digit bit maps shown in Figure P3.2. These vectors 
are stored as weights within the network. Both template and input vectors 
contain 42 rows with a 0 entry for white pixels and a 1 entry for black 



Figure P3.1 (jaeometric interpretation for classification of cube vertices in three-dimensional 
space (class nutnbers in parentheses). 


pixels^ The measure of matching of an input x to a template v is the scalar 
produkt v'x. The product value Xj is proportional to the similarity of input 
and template. 

Since the templates have different numbers of black pixels, each Xj 
should be multiplied by a coefficient Wj inversely proportional to the num- 
ber of black pixels in each template to equalize measures of similarity. 
This simple classification procedure reduces now to the selection of the 
maximum of the four responses being normalized matching scores x-Wj 
and cømputing i Q denoting the number of the largest input. 

Compute and tabulate the normalized matching scores between each 
pair of the digits 0, 1, 8, 9 shown at the input in Figure P3.2. Inspect the 
matching scores and for each digit being input find 

(a) it$ nearest neighbor digit with the strongest overlap of the two vectors, 

(b) it$ furthest neighbor digit with the weakest overlap of the two vectors. 

Assurrie now that the input bit maps of digits are randomly distorted. 
Based on the matching scores calculated determine which is 

(c) thje most likely misclassification pair among the pairs of digits 

( d) the least likely misclassification pair among the pairs of digits. 
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Matching 

scores Normalized 



(a) 



Figure P3.2 Template matching-based digit recognition system for Problem P3.2: (a) classifier 
and (b) character bit maps used for classification. 

P3.3 For the minimum-distance (linear) dichotomizer, the weight and aug- 
mented pattem vectors are 



(a) Find the equation of the decision surface in the pattem space. 

(b) Find the equation of the decision surface in the augmented pattem 
space. 

(c) Compute the new solution weight vector if the two class prototype 
points are 

Xi = [2 5]* and x 2 = [ — 1 -3] / . 

(d) Sketch the decision surfaces for each case in parts (a), (b), and (c). 

P3.4 Compute the solution weight vectors w 1? w 2 , w 3 , and w 4 needed for the 
linear machine to classify pattems as indicated in Figure P3.1 and using 
the decision planes shown. The decision plane equations should be de- 
veloped in this problem based on inspection of conditions in the pattem 
space, not on the iterative training procedure. 
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P3.5 Sketch the example discriminant functions gj(x) and g 2 (x) as in Example 
3.1 far the dichotomizer with the folio wing augmented weight and input 
vectors, respectively: 



knowing that the pattem y = [0 0 1 belongs to class 1. 

P3.6 A dicliotomizer has to be trained to assign Jt = 0 and x = 2 to class 
1 and 2, respectively. Display the movement of the weight vector on 
the weight plane starting from the initial weights of [ 1 1 ] 1 values and 

folloW intermediate steps until weights fail to the solution region 

(a) lise c = 1. 

(b) Use A = 1, as in (3.31). 

(c) Use A = 2, as in (3.31). 

(Hint: In the case of A = 1, adjusted weights fail on the decision plane. 
This can also happen for A ^ 1 in other training modes. Thus, to achieve 
the correct classification, the final adjustment must move weights from the 
decision plane into the decision region.) 

P37 Prototype points are given as 

x, = [5 1]',x 2 = [7 3]',x 3 = [3 2]',x 4 = [5 4]': class 1 
x 5 = [0 0 ]‘, x 6 = [-1 -3 ]',x 7 = [-2 3]',x 8 = [-3 0] r : class 2 

(a) Determine if the two classes of pattems are linearly separable. 

(b) Determine the center of gravity for pattems of each class, and find 
apd draw the decision surface in pattem space. 

(c) Using (3.9) or (3.11) design the dichotomizer for the given prototype 
points and determine how it would recognize the folio wing input 
pattems of unknown class membership: 

X= [4 2]', x= [0 5]'. x = [H o] 

P3.8 Class prototype vectors are known as 

T 2 1 

Xj = [— 2] , x 2 = — ~ , x 3 = [3] : class 1 

x 4 = [ 1 ] , x 5 = [ 2 ] : class 2 

(a) Draw pattems in the augmented pattem space. 

(b) Draw separating lines in the augmented weight space for each pattem. 
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(c) Find the set of weights for the linear dichotomizer, or conclude 
that this is not a linearly separable classification problem and go 
to part (d). 

(d) Design the dichotomizer using a single discrete perceptron and non- 
linear discriminant function of quadratic type. 


[Hint: g(x) needs to be chosen as a quadratic function of x.] 

P3.9* Implement the Single Discrete Perceptron Training Algorithm for c = 1 
for the discrete perceptron dichotomizer, which provides the following 
classification of six pattems: 

x = [0.8 0.5 0]', [0.9 0.7 0.3]', [1 0.8 0.5]': class 1 

x = [0 0.2 0.3]', [0.2 0.1 1.3]', [0.2 0.7 0.8]': class 2 

Perform the training task starting from initial weight vector w = 0 and 
obtain the solution weight vector. 

P3.10* Repeat Problem P3.9 for the user-selectable coefficient A. Solve the train- 
ing task for A = 1, then for A = 2 as in (3.31). Note that the initial weight 
vector w ^ 0 should be assumed. 


P3.ll Derive the continuous perceptron training rule (delta training rule) (3.40) 
and obtain a formula for w* +1 similar to (3.53). Assume the unipolar 
activation function of the neuron 


finet ) = 


1 

1 + exp (-net) 


Note that f'(net) should be expressed as function of the neuron’s output. 

P3. 12* Implement the delta training rule as in (3.53) for a single continuous 
perceptron suitable for the classification task specified in Problem P3.9. 
Train the network using data from Problem P3.9. Try several different 
leaming constant values and note the speed of weight convergence in 
each leaming simulation. 


P3.13 The error function to be minimized has the contour map shown in Figure 
P3.13 and is given by 



— Wj) 2 + (1 — w x ) 2 


(a) Find analytically the gradient vector 


V£(w) = 


BE 

dw x 

dE 


dw 2 


(b) Find analytically the weight vector w* that minimizes the error func- 
tion such that V£(w) = 0. 
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E( w 1 , w 2) 

-5.00 -4.00 -3.00 -2.00 -1.00 ø.øø 1.00 2.00 3.00 4.00 5.00 

5.00 

4.00 

3.00 

2.00 
1 .00 

^ 0.00 

l.øø 

2.00 

■3.00 

4.øø 

■5.00 
-5 

Figure P3.13 Contour map of the error function for Problem P3.13. 


(c) Based on part (a) draw the computed gradient vector at three selected 
weight values. Superimpose the three vectors on Figure P3.13 and 
check for the consistency of the gradient vectors with the contours’ 
direction and steepness. 

P3J4 The multicategory trainable classifier using the maximum selector requires 
weight adjustment as expressed in (3.54). Implement the sequence of train- 
ing steps that is needed to classify correctly the three classes as in Example 
3.2 and Figure 3.8. Present the pattems in the sequence P x , P 2 , P 3 , and 
start from initial weights w — 0. Assume c = 1. The augmented pattem 
component equals — 1 . 

P3. 75* Implement the training procedure (3.54) for a four-class classifier having 
the prototype points as in Problem P3.1. Use c = 1 and obtain weights 
w l9 w 2 , w 3 , and w 4 , connecting each of the four inputs to the maximum 
selector as shown in Figure 3.7 for R = 4. 

Then convert the classifier obtained using the maximum selector with 
four inputs to the form as in Figure 3.21 so that it uses four discrete 
perceptrons. Determine the additional threshold values 7\, . . . , T 4 if the 
augmented pattem component is y 4 = —1. 
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P3.16 After the completion of training of the linear machine, its weights have 
been obtained as in Figure P3.16. The pattems that have been submit- 
ted during the supervised training cycle are eight vertices of the three- 
dimensional cube spanned across 0 < ^ < 1. The augmented pattem 
component is assumed to be y 4 = — 1 on the figure. Find the class mem- 
bership of each of the cube vertices. 

P3.17 Convert the classifier shown in Figure P3.16 to the form found in Figure 
3.21 and determine the additional threshold values T x , ..., T 4 necessary 
for correct classification. 

P3.18 The initial weight vectors for a three-class classifier with discrete bipolar 
perceptrons performing the classification as in Example 3.5 are selected 
as w| = = W3 = 0. For pattems presented in the sequence P x , P 2 , P 3 , 

Pi, P 2 > • • • , and trained as in (3.58) find 
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(a) final weights for c = 1. Assume net = 0 as an incorrect response, 
th}us requiring weight adjustment. 

(b) arty “indecision” regions of the trained classifier that may be of sig- 
nificance on the x x , x 2 plane in case of recognition of points that are 
npt identical to the prototypes. 

P3.19 Training the weight adjustments as in (3.58) (RDPTA) of a three-class 
classifier using a three discrete bipolar perceptron network has been com- 
pleted in three steps with the augmented pattem component of —1 and 
c = 1. 

Step 1 : presentation of y { resulted in adjustment of all weights initial- 

ized as 

w | = w i = w i = o. 

Step 2: presentation of y 2 resulted in adjustment of weight W3 only. 
Step 3: presentation of y 3 resulted in adjustment of only. 

The final weights are 


, w* = 


Find pattems x l5 x 2 , and x 3 used for the training. 

P320 Assume that the training rule given in (3.38) has been modified to the 
form 



= w* + c k e k 



where] c k is a constant, 0 < c k < 1, and the error e k is defined as 


e k = d k - y/ kt y k 

and is equal to the difference between the current desired response d k and 
the current value of the neuron’ s activation net k [see (2.43) for the leaming 
rule]. Assume that the (k + l)’th step requires weight adaptation. Pro ve 
that after the weight correction is made in the k'th step, and again the 
same input pattem y* +1 = y k is presented, the error is reduced (1 — c k ) 
times. [This is the property of Widrow-Hoff leaming rule (Widrow 1962).] 

P3.27*Implement the RDPTA training algorithm for the four-category classifier 
using four bipolar discrete perceptrons and using eight prototype points 
in three-dimensional input pattem space. Use c = 1 and compute weights 
w lf W 2 , w 3 , and w 4 connecting each of the four inputs to the perceptrons 
1, 2, 3, and 4, respectively, as a result of the training. Carry out the 
training using the prototype points as specified in Problem P3.1. Sketch 
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the decision planes produeed as a result of the training and verify that the 
classification is in agreement with specifications. 

P3 .22* Design and train the classifier of printed characters L and l shown in the 
bit mdp form in Figure P3.22. Assign input vector entries 0 and 1 to white 
and black pixels, respectively. Use a single discrete bipolar perceptron with 
ten input weights, including the threshold. 

P3.23 The three-class classifier from Example 3.5 produces the decision regions 
as illustrated in Figure 3.22. In addition to three correct classification re- 
gions for points used during training, it also produces four “indecision” 
regions, shaded on the figure. Compute the classifier’s response in each of 
the four “indecision” regions by inspecting the decision surfaces shown 
in the figure. 
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MULTILAYER 

FEEDFORWARD NETWORKS 

A certain lady claims that , after 
tasting a cup of tea with mille , she 
can say which is first poured in the 
cup — milk or tea. This lady States that 
even if she sometimes makes mistakes , 
she is more often right than wrong. 

R. FISHER 

4.1 Linearly Nonseparable Pattern Classification 

4.2 Delta Learning Rule for Multiperceptron Layer 

4.3 Generalized Delta Learning Rule 

4.4 Feedforward Recall and Error Back-propagation Training 

4.5 Learning Factors 

4.6 Classifying and Expert Layered Networks 

4.7 Functional Link Networks 

4.8 Concluding Remarks 

O ur study thus far has focused on using neural networks for classification 
with input pattems that are linearly separable. Networks discussed in the 
previous chapter have been able to acquire experiential knowledge during the 
supervised training process. Both two- and multicategory classification leam- 
ing procedures have been formulated for single-layer networks. The experiential 
knowledge acquisition has been based on the convergent training of single-layer 
discrete perceptron networks, which can adjust their weights incrementally in 
order to achieve correct classification of linearly separable sets of pattems. 

To this point, we have discussed networks that use a linear combination 
of inputs with weights being proportionality coefficients. Such networks work 
with the argument of the nonlinear element simply computed as a scalar product 
of the weight and input vectors. We may call networks of this type linear keeping 
in mind, however, that the nonlinear operation finet) is still performed by the 
decision element on the argument net at every computing node. 
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For training pattems that are linearly separable, the linear neural network 
introduced in previous chapters must be modified if it is to perform the correct 
classificatiori. The modification could involve either a departure from the concept 
of the linear discriminant function or a major change in network architecture. 
Typical nonlinear discriminant functions are chosen to be quadratic or piecewise 
linear. The piecewise linear discriminant functions can be implemented by a 
single-layer linear network employing perceptrons (Nilsson 1965). Our chosen 
architecture, however, will be that of the multilayer network. Each layer of a 
multilayer network is composed of a linear network, i.e., it is based on the 
original conCept of the linear discriminant function. As discussed in Section 2.2, 
multilayer networks are of the feedforward type and are functionally similar to 
the networks covered in Chapter 3. 

Although multilayer leaming machines have been known of for more than 
a quarter of a century, the lack of appropriate training algorithms has prevented 
their successful applications for practical tasks. Recent developments in the super- 
vised training algorithms of multilayer networks with each layer employing the 
linear discripiinant function have led to their widespread and successful present 
use. Multilayer networks are often called layered networks . They can imple- 
ment arbitrary complex input /output mappings or decision surf aces separating 
pattem classes. 

The most important attribute of a multilayer feedforward network is that it 
can leam a mapping of any complexity. The network leaming is based on re- 
peated presetitations of the training samples, as has been the case for single-layer 
networks. The trained network often produces surprising results and generaliza- 
tions in applications where explicit derivation of mappings and discovery of 
relationships is almost impossible. 

In addition to the classification tasks studied in Chapters 3 and 4 to gain 
insight into the single- and multilayer neural networks, many other tasks can be 
performed py such networks. Examples include function approximation, hand- 
written character recognition, speech recognition, dynamical plant Control, robot 
kinematics ånd trajectory generation, expert systems, and many other applica- 
tions. We pøstpone the discussion of applications until Chapter 8 and focus on 
principles and training methods for layered networks. 

Multilayer feedforward network theory and applications have been domi- 
nating in the neural network literature for several years now. Researchers have 
looked for the properties of layered networks and their training methods. Indus- 
trial entrepreneurs found a wealth of successful applications, others developed 
accelerator boards to speed up network simulations on personal computers, and 
graduate students began to write and seil inexpensive training software. Layered 
networks seem to be the most widespread neural network architecture at present. 

To understand the nature of mapping performed by multilayer feedforward 
networks and the training of such networks, we must return to the fundamentals. 
In this chapter we will study trainable layered neural networks employing the 
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input pattem mapping principles. In the case of layered network training, we 
will see that the mapping error can be propagated into hidden layers so that the 
output error information passes backward. This mechanism of backward error 
transmission is used to modify the synaptic weights of intemal and input layers. 
The delta leaming rule and its generalization are used throughout the chapter for 
supervised training of multilayer continuous perceptron networks. 


4.1 — — 

LINEARLY NONSEPARABLE 
PATTERN CLASSIFICATION 

Expressions (3.32) formulated the condition for linear separability of pat- 
terns. This condition can now be briefly restated for the case of linearly nonsep- 
arable dichotomization. Assume the two training sets ^ and °fJ 2 °f augmented 
pattems are available for training. If no weight vector w exists such that 

y'w > 0 for each y e and 

y'w < 0 for each y e <S/ 2 

then the pattem sets < 3/ 1 and are linearly nonseparable. 

Let us now see how the original pattern space can be mapped into the 
so-called image space so that a two-layer network can eventually classify the 
pattems that are linearly nonseparable in the original pattem space. 

Assume initially that the two sets of pattems and d£ 2 should be classified 
into two categories. The example pattems are shown in Figure 4.1(a). Three ar- 
bitrary selected partitioning surfaces 1, 2, and 3 have been shown in the pattem 
space *. The partitioning has been done in such a way that the pattem space now 
has compartments containing only pattems of a single category. Moreover, the 
partitioning surfaces are hyperplanes in pattem space E n . The partitioning shown 
in Figure 4.1(a) is also nonredundant, i.e., implemented with minimum number 
of lines. It corresponds to mapping the ^-dimensional original pattem space x 
into the three-dimensional image space o . 

Recognizing that each of the decision hyperplanes 1 , 2, or 3 is implemented 
by a single discrete perceptron with suitable weights, the transformation of the 
pattem space to the image space can be performed by the network as in Figure 
4.1(b). As can be seen from the figure, only the first layer of discrete perceptrons 
responding with o l9 o 2 , and o 3 is involved in the discussed space transformation. 
Let us look at some of the interesting details of the proposed transformation. 

The discussion below shows how a set of pattems originally linearly non- 
separable in the pattem space can be mapped into the image space where it 
becomes linearly separable. Realizing that the arrows point toward the positive 
side of the decision hyperplane in the pattem space, each of the seven com- 
partments from Figure 4.1(a) is mapped into one of the vertices of the [—1,1] 
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Pattem-to-image I Single layer 
transformation I classifier 


Linearly 

separable 

image 

(b) 

Figure 4.1 a,b Classification of linearly nonseparable patterns: (a) partitioning in the pattern 
space, (b) layéred network implementing the classification from part (a). 


cube. The result of the mapping for the pattems from the figure is depicted in 
Figure 4.1(d) showing the cube in image space o i9 o 2 , and o 3 with corresponding 
compartment labels at comers. 

The pattems of class 1 from the original compartments B, C, and E are 
mapped intb vertices (1, — 1, 1), (-1,1,1), and (1,1,— 1), respectively. In tum, 
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Figure 4.1c Classification of linearly nonseparable patterns (continued): (c) classification in the 
image space by the output perceptron from part (b). 


pattems of class 2 from compartments A and D are mapped into vertices 

(— 1, — 1, 1) and (—1, 1, —1), respectively. This shows that in the image space o , 

the pattems of class 1 and 2 are easily separable by a plane arbitrarily selected, 
such as the one shown in Figure 4.1(c) having the equation o 1 + o 2 + o 3 = 0. 
The single discrete perceptron in the output layer with the inputs o lf o 2 , and o 3 , 
zero bias, and the output o 4 is now able to provide the correct final mapping of 
pattems into classes as follows: 

_ f sgn (o Y + o 2 + o 3 ) > 0 : class 1 

° 4 \ sgn (o l + o 2 + o 3 ) < 0 : class 2 

From the discussion above, we see that the procedure of training, in the 
case of the layered network, has to produce linearly separable images in the 
image space. If the pattem parameters and desired responses were completely 
known, the weight calculation of the two-layer linear machine could be ac- 
complished, as just described, without an experiential knowledge acquisition. In 
another approach, a single-layer-only machine with the piecewise linear discrim- 
inant functions can be employed. Such a classifier with a nonlinear discriminant 
function would be able to produce appropriate partitioning of the input space and 
to classify pattems that are linearly nonseparable. 

Although both these approaches are feasible, they possess a rather significant 
disadvantage. Since we favor acquiring experiential knowledge by the layered 
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leaming network over the calculation of weights from pattem parameters, we 
should adopt and use its stepwise supervised leaming algorithms. We will also 
favor employing linear discriminant functions instead of piecewise, quadratic, or 
any other type of nonlinear function. This preference is based on the premise that 
the continuous perceptron with net in the form of a scalar product is trainable and 
it offers both mapping and representation advantages. However, no efficient and 
systematic leaming algorithms were known for layered linear machines employ- 
ing perceptrons until recently (Werbos 1974; McClelland and Rumelhart 1986). 


EXAMPLE 4.1 


In this ^xample we design a simple layered classifier that is able to classify 
four linearly nonseparable pattems. The discussion will emphasize geomet- 
rical considerations in order to visualize the input-to-image-to-output space 
mapping during the leaming cycle. The classifier is required to implement 
the XOR function as defined in (3.21) for two variables. The decision func- 
tion to 3e implemented by the classifier is: 



The pattems, along with the proposed pattem space partitioning implemented 
by the jfrrst layer consisting of two bipolar discrete perceptrons, are shown 
in Figure 4.2(a). The arbitrary selected partitioning shown is provided by 
the twd decision lines having equations 



For the lines shown, the corresponding unit normal vectors pointing toward 
the positive side of each line are equal to 
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(b) 

Figure 4.2 Figure for Example 4.1: (a) pattern to image space transformation and (b) classifier 
diagram. 
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Figure 4.2(a) shows that the three compartments produced in the Xj, x 2 
plane arø mapped through the two first-layer discrete perceptrons with TLU 
TLU #2 shown in Figure 4.2(b). The mapping of pattems into the 


#1 and 


image space is: 


o { = sgn ( -2x x + x 2 - 1 


o 2 = sgn [ x x - x 2 - - 


(4.2a) 


perfoi 


The 
4.3(a) prj 
first 

image sp; 
rather th 
collapse 
with ima: 
decision 
B,C in 


mied mapping is explained in more detail in Figure 4.3. Figure 
pvides the mapping and classification summary table. As before, the 
provides an appropriate mapping of pattems into images in the 
»ace. The second layer implements the classification of the images 
kn of the original pattems. Note that both input pattems A and D 
|in the image space into a single image (-1,-1). The image space 
iges of original pattems is shown in Figure 4.3(b). An arbitrary 
line providing the desired classification and separating A,D and 
e image space has been selected as 


°\ °2 1 = 0 
The TLI| #3 implements the decision o 3 as 

o 3 — sgn 4* o 2 + 1) 


(4.2b) 


(4.2c) 


The complete diagram of the classifier is shown in Figure 4.2(b). The out- 
put perceptron has weights w x - 1, w 2 = 1, and w 3 — -1 to satisfy 
Eq. (4.2cj). Figure 4.4 illustrates the joint mapping between the pattem and 
output space. Discrete perceptron mapping of plane x x , x 2 into o 3 is shown 
in Figure 4.4(a). Figure 4.4(b) presents the contour map o 3 (x l ,x 2 ) for contin- 
uous bipolar perceptrons with A = 6 used rather than discrete perceptrons. 
The weight values for the mapping depicted are identical to those in Figure 
4.2(b), ojily the sgn(-) functions have been replaced with functions defined 
as in (2.3a). 

This example provides insight into the parametric design of a layered 
classifier for linearly nonseparable pattems. The design has used arbitrary 
but meaningful mapping of inputs into their images and then into outputs, 
and decision lines suitable for classification of pattems with known param- 
eters have been produced by inspection. The example, however, has not yet 
involved an experiential training approach needed to design neural network 
classifiers. mm 


Before we derive a formal training algorithm for layered perceptron net- 
works, one additional aspect is worth highlighting. The layered networks develop 



significant, if not powerful, self-organization and mapping properties during train- 
ing. The self-organization and mapping do not have to be engineered in advance. 
Instead, the acquisition of knowledge takes place gradually during the training 
phase by inspection of mapping examples. 

Figure 4.5 (a) illustrates the specific case of classification of more involved 
planar pattems. Both shaded disjointed areas A and B on the figure belong to 
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the category 1, and the rest of the pattem space of the plane belongs to the 
category 2. The classification of the input pattems, or mapping of the input 
space to class^s, can be provided by the network shown in Figure 4.5(b). If the 
weighted sum of the nonaugmented pattems exceeds the threshold value 7, the 
TLU element responds with +1, otherwise with —1. 
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o 3( x 1 , x 2) , la mb d a =6 

- 1 . 50 - 1 . 23 - 0 . 95 - 0 . 67 - 0 . 40 - 0.12 0.15 0.43 0.70 0.98 1.25 



xl 

(b) 


Figure 4.4b Linearly nonseparable classification of pattems in Example 4.1 (continued): 
(b) input to output mapping using continuous perceptrons (A = 6). 


It is easy to verify that the input layer provides the folio wing mapping: units 
#1 and #2 select the vertical strip a < x x < b, and units #3 and #4 select the 
strip c < x x < d. Units #5 with #6 select the horizontal strip e < x 2 < /. The 
output layer unit has the threshold value of 3.5. It responds with +1 signifying 
membership in class 1 when either its weighted input activations from units 1 
and 2 at the summing node are both 1, or its weighted input activations from 
units 3 and 4 are both 1, and, at the same time, its weighted input activations 
from units 5 and 6 are also both 1. The summed excitatory input value to TLU 
#7 is thus of value 4 in the case of inputs in either area A or B, which results 
in a TLU #7 output of +1, thus signifying class 1. 

The layered networks consisting of discrete perceptrons described in this 
section are also called “committee” networks (Nilsson 1965). The term commit- 
tee is used because it takes a number of votes of the first layer to determine 
the input pattem classification. The inputs in the pattem space are first mapped 
into the images consisting of vertices of the cube [-1, 1]. Up to 2 n images, each 
image being a vertex of the cube, are available as binary images of n-tuple input 
vectors. Then, the image space mapping is investigated by the output perceptrons 
as to its separability and as to the mapped vectors class membership. 
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4.2 

DELTA LEARNING RULE FOR 

MULTIPERCEPTRON LAYER 

The following discussion is focused on a training algorithm applied to 
the multilayer feedforward networks. The algorithm is called the error back- 
propagation training algorithm. As mentioned earlier in this chapter, the algo- 
rithm has reawakened the scientific and engineering community to the modeling 
of many quantitative phenomena using neural networks. 

The back-propagation training algorithm allows experiential acquisition of 
input /output mapping knowledge within multilayer networks. Similarly, as in 
simple cases of the delta leaming rule training studied before, input pattems are 
submitted during the back-propagation training sequentially. If a pattem is submit- 
ted and its classification or association is determined to be erroneous, the synaptic 
weights as well as the thresholds are adjusted so that the current least mean- 
square classification error is reduced. The input /output mapping, comparison of 
target and actual values, and adjustment, if needed, continue until all mapping 
examples from the training set are leamed within an acceptable overall error. 
Usually, mapping error is cumulative and computed over the full training set. 

During the association or classification phase, the trained neural network it- 
self operates in a feedforward manner. However, the weight adjustments enforced 
by the leaming mies propagate exactly backward from the output layer through 
the so-called “hidden layers” toward the input layer. To formulate the leaming 
algorithm, the simple continuous perceptron network involving K neurons will 
be revisited first. Let us take another look at the network shown in Figure 3.23. 
It is redrawn again in Figure 4.6 with a slightly different connection form and 
notation, but both networks are identical. 



Figure 4.6 Single-layer network with continuous perceptrons. 
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input 

thii: 


The 

tively. We 
signal values 
As before, 
to the A ’th ne 
Using 
can be exprei 


thp 

neu 

the 


and output values of the network are denoted y ; and o k , respec- 
s denote y y -, for j = 1,2,...,/, and o k , for k = 1, 2, . . . , K, as 
jat the / th column of nodes, and £’th column of nodes, respectively. 

weight w k j connects the output of the /th neuron with the input 
|uron. 

vector notation, the forward pass in the network from Figure 4.6 
dsed as follows 


o = T[Wy] (4.3a) 

where the inpjut and output vector and the weight matrix are, respectively 



>r 



y = 

yi 

, 0 = 

o 2 


.yj. 


.o K . 


W = 


w n w 12 
w 2 \ W 2 2 


% 

™2J 

w Kj\ 

0 

o 

/(•). 


L M ’/n w k2 

and the nonliijiear diagonal operator F[ • ] is 

/(•) 0 

0 /(•) 

n •] = ■ 

I 

o o 

The desired (t|arget) output vector is 

di 
di 

\~d K J 

Observe that the activation vector net* of the layer k is contained in the brackets 
in relationship 


The erroii expression introduced in (2.37) for a single perceptron is now 


generalized to 


(4.4) refers to 


(4.3a) and it can be expressed as 

net* = Wy 


(4.3b) 


include all squared errors at the outputs k = 1, 2, . . . , K 


1 * 

E P = ~ Z (d k 

1 k= 1 


°pk) 2^ P ° p \ 


1, 


(4.4) 


for a specific pattem p, where p = 1, 2, . . . , P. Let us note that the subscript p in 


|a specific pattem that is at the input and produces the output error. 
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At this point, the delta training rule introduced in (2.36), and later intuitively 
obtained in (3.55), can be formally derived for a multiperceptron layer. Let us as- 
sume that the gradient descent search is performed to reduce the error E p through 
the adjustment of weights. For simplicity, it is assumed that the threshold values 
T k , for k = 1, 2, . . . , K, are adjustable along with the other weights, and no 
distinction is made between the weights and thresholds during leaming. Now the 
thresholds T k are leamed exactly in the same manner as the remaining weights. 
This, of course, implies that 


Wjj = T k , for k = 1, 2, . .* . , K 

and the fixed input is of value 


yj = -i 


during both the training and feedforward recall phases. Requiring the weight 
adjustment as in (2.39) we compute individual weight adjustment as follows: 

dE 

\w kj = -17— (4.5a) 

dw kj 

where the error E is defined in (4.4) with subscript p skipped for brevity. For 
each node in layer k, k = 1, 2, . . . , K, we can write using (4.3b) 

j 

net k = I w kj yj (4.5b) 

7=1 

and further, using (4.3a) the neuron’s output is 


o k = f(net k ) 


The error signal term S called delta produced by the &’th neuron is defined for 
this layer as follows 



dE 

d{net k ) 


(4.6) 


It is obvious that the gradient component dE/dw kj depends only on the net k of 
a single neuron, since the error at the output of the £’th neuron is contributed to 
only by the weights w k j, for j = 1 , 2, . . . , /, for the fixed k value. Thus, using 
the chain rule we may write 


dE = dE d(net k ) 
dw kj d(net k ) dw kJ 

The second term of the product of Eq. (4.7) is the derivative of the sum of 

Products of weights and pattems w kx y x + w k 2 y 2 H + as in (4.5b). Since 

the values \j, for j = 1, 2, ...,/, are constant for a fixed pattem at the input, 
we obtain 

d(net k ) 

dw kj 


(4-8) 
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Combining (4 


6) and (4.8) leads to the following form for (4.7): 

dE 


dw t 


kj 


= -&okyj 


(4.9) 


The weight adjustment formula (4.5a) can now be rewritten using the error signal 
8 ok term as below 

A w kj = rj8 olc yj, for k = 1, 2, . . . , K and j = 1,2 J (4.10) 

Expression (4.10) represents the general formula for delta training / learning 
weight adjustments for a single-layer network. It can be noted that A w k j in 
(4.10) does not depend on the form of an activation function. As mentioned 
in Chapter 2, the delta value needs to be explicitly computed for specifically 
chosen activation functions. It also follows from (4.10) that the adjustment of 
weight w k j is proportional to the input activation y jt and to the error signal value 
4* at the k' thi neuron’s output. 

To adapt the weights, the error signal term delta 4* introduced in (4.6) needs 
to be computed for the /.'th continuous perceptron. Note that E is a composite 
function of net k , therefore it can be expressed for k = 1, 2, . . . , K, as follows: 

E(net k ) = E [o k (net k )] (4.11) 

Thus, we havé from (4.6) 


Denoting the 


and noting thi 


&ok ~ 


dE 


do b 


do k d(net k ) 

^econd term in (4.12) as a derivative of the activation function 

A d°k 


(4.12) 


fk( net k) 


d(net k ) 


(d k ~ o k ) at t 
fk’inetk), whic 


activation value 


dE - _ * 

{dk ° k) 

allows rewriting formula (4.12) as follows 

S ok = (4 - °k)fk'(net k ), for k = 1, 2, .... K 


(4.13a) 


(4.13b) 


(4.14) 


Equation (4.14) shows that the error signal term 8 ok depicts the local error 


he output of the &’th neuron scaled by the multiplicative factor 
h is the slope of the activation function computed at the following 


net k =f \o k ) 


The final formula for the weight adjustment of the single-layer network can 
now be obtainied from (4.10) as 

A w kj = 17(4 _ o k )f k ’(net k )y } 


(4.15a) 
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and it is identical to the delta training rule (2.40). The updated weight values 
become 


w' k j = w k j + A w k j for k = 1, 2, . . . , K, and 7=1,2,...,/ 


(4.15b) 


Formula (4.15) refers to any form of the nonlinear and differentiable activation 
function finet) of the neuron. Let us examine the two commonly used delta 
training rules for the two selected typical activation functions finet ). 

For the unipolar continuous activation function defined in (2.4a), finet) can 
be obtained as 


finet) 


zxp(-net) 

[1 + exp (—net)] 2 


(4.16a) 


This can be rewritten as 
finet) = 


1 

1 + exp (—net) 


1 + exp (-net) - 1 
1 + exp (—net) 


(4.16b) 


or, if we use (2.4a) again, it can be rearranged to a more useful form involving 
output values only 


finet) = o(l — o) (4.16c) 

Let us also observe that the delta value of (4.14) for this choice of the activation 
function becomes 


&ok ~ (djt °k)°k( 1 °k) (4-17) 

The delta value for the bipolar continuous activation function as in (2.3a) can be 
expressed as 

Kk = \id k - o k )( 1 - o 2 k ) (4.18a) 

which uses the folio wing identity for finet) 

finet) = 1(1 - o 2 ) (4.18b) 

Expression (4.18b) was derived in Section 3.6, see formulas (3.48) through (3.51). 

Summarizing the discussion above, the updated individual weights under the 
delta training rule can be expressed for k = 1, 2, . . . , K, and j = 1, 2, ...,/, as 
follows 


w'kj = W kj + y](d k - o k )o k i 1 - o k )yj 


(4.19a) 


for 


and 


o k - 


1 


1 + exp(— net k ) 


w'kj = w kj + -T} (d k - o k )(l - o 2 )yj 


(4.19b) 
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for 


The updated 
shown in 


Figur* 


1 + exp(— net k ) 

jveights under the delta training rule for the single-layer network 
e 4.6 can be succinctly expressed using the vector notation 

W' = W + 778./ (4.20) 


where the errcj>r signal vector 8 0 is defined as a column vector consisting of the 
individual errør signal terms: 


. 1 $ 


Error signal 
(4.17) or (4, 
entries 8 ok are 
It should be 4 
dimensional, 
lixed at -1. 

The trainii 
diversified res 
Chapter 3. Th|* 
associations 
As such, a c< 
versified set 
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Summaryl 

Algorithm 


°k — 2 
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u ol 
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L 8 oK 


ector entries 8 ok are given by (4.14) in the general case, or by 
a) depending on the choice of the activation function. Noticeably, 
local error signals dependent only on o k and d k of the k ' th neuron, 
oted that the nonaugmented pattem vector in input space is J — 1 
Input values for j = 1, 2, — 1, are not limited, while yj is 

ng rule for this single-layer neural network does provide a more 
ponse than the network of K discrete perceptrons discussed in 
le properly trained network also provides a continuous degree of 
at would have been of the binary form if TLUs alone were used. 
continuous perceptron network will be able to provide a more di- 
responses rather than only binary-valued vectors, which are the 
;es at the classifier’s output. These novel aspects of a continuous 
t|work will be discussed later along with introductory examples of 


algorithm for multicategory perceptron training is given below. 


of the Multicategory Continuous Perceptron Training 
(MCPTA) 


Givøn are P training pairs arranged in the training set 
{y 1 ,d 1 ,y 2 ,d 2 ,...,y P ,d F } 

where y ( - is (J X 1), d ; is ( K X 1), and i = 1, 2, . . . , P. Note that the 
./’th component of each y, has the value —1 since input vectors have 
been augmented. Integer q denotes the training step and p denotes the 
counter within the training cycle. 

Step 1 : 17 > 0, £ max > 0 chosen. 
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Step 2: Weights W are initialized at small random values; W is 
(KXJ): 

q *— 1, p <— 1, E <— 0. 

Step 3: Training step starts here. Input is presented and output com- 
puted: 

y<-y p( d<-d p 

°k <-/(w*y)> for k = 1, 2, . . . , K 
where w* is the £’th row of W and f(net) is as defined in (2.3a). 

Step 4: Weights are updated: 

1 , 

w* <- w* + -7]{d k - o k )( 1 - o k )y, for k = 1, 2, .... K 

where w* is the fc’th row of W (See Note at end of list.) 

Step 5: Cumulative cycle error is computed by adding the present 
error to E: 

E «- - o k f + E, for k = 1, 2, . . . , K. 

Step 6 : Ifp < P , then p<— p+1, <jr<— ø+l, and go to Step 3; 
otherwise, go to Step 7. 

Step 7: The training cycle is completed. 

For E < £ max terminate the training session. 

Output weights W, q y and E. If E > £ max , then E <— 0, 
p 4 — 1 , and initiate a new training cycle by going to Step 3. 

■■ NOTE: If formula (2.4a) is used in Step 3, then the weights 

are updated in Step 4 as follows: 

w* W£ + i} (d k — o k )o k ( 1 — o k ) y, for k = 1 , 2 , . . . , Å' 


4.3 

GENERALIZED DELTA 

LEARNING RULE 

We now focus on generalizing the delta training rule for feedforward layered 
neural networks. The architecture of the two-layer network considered below is 
shown in Figure 4.7. It has, strictly speaking, two layers of processing neurons. 
If, however, the layers of nodes are counted, then the network can also be labeled 
as a three-layer network. The fth column of signals would be understood in such 
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Figure 4.7 Laye red feedforward neural network with two continuous perceptron layers. 

a case to be an output layer response of a nonexistent (input) layer of neurons. 
We can thus refer to the architecture in Figure 4.7 as a two-neuron layer network, 
or three-node layer network. 

There is rio agreement in technical literature as to which approach is to be 
used to describe network architectures. In this text we will use the term “layer” 
in reference to the actual number of existing and processing neuron layers. There- 
fore, we will not count input terminals as layers. This convention is becoming 
more frequently adopted and seems more logical since the input nodes play no 
significant role in processing. Thus, the network of Figure 4.7 is a two-layer 
network. Let us also note that an AMayer network has N — l layers of neurons 
whose outputs are not accessible. 

Layers w th neurons whose outputs are not directly accessible are called 
internal or hidden layers . Thus, all but the output layer are hidden layers. Since 
the output of layer j of the neurons is not accessible from input and output, the 
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network from Figure 4.7 can be called a single hidden-layer network. In subse- 
quent considerations, we will derive a general expression for the weight increment 
A Vji for any layer of neurons that is not an output layer. The reader may notice 
the change of subscripts. The computed weights lead now from node i toward 
node j as shown in the figure. 

The negative gradient descent formula (4.5a) for the hidden layer now reads 


a dE 

Av » = ~ v d^- 


and formula (4.7) becomes 


for j = 1, 2, . 

. . , J and i = 1, 2, . . . , / 

(4.21a) 

dE _ dE 

d(netj) 

(4.21b) 

dvji d(netj) 

ty, 


Let us notice that the inputs to the layer are z i9 for i — 1, 2 Based on 
relation (4.8), the second term in the product (4.21b) is equal to z f , and we may 
express the weight adjustment similarly to (4.10) as 


Av j7 = t idyjZi (4.21c) 

where S y j is the error signal term of the hidden layer having output y. This error 
signal term is produced by the /th neuron of the hidden layer, where j - 1, 2, 
...,/. The error signal term is equal to 

8yj = ~ d(^i~y fory = u 2> ' (4-22) 

In contrast to the output layer neurons’ excitation net k , which affected the k’th 
neuron output only, the netj contributes now to every error component in the 
error sum containing K terms specified in expression (4.4). The error signal term 
Syj at the node j can be computed as follows: 


8yj ty 


dE ty 


d(netj) 


where 


%rU^i [dk - n " e,km *) 


and, obviously, the second term of (4.23a) is equal to 
Routine calculations of (4.23b) result in 

Calculation of the derivative in braces of expression (4.24a) yields 


(4.23a) 

(4.23b) 

(4.23c) 

(4.24a) 


dE 

ty 


K 

S (4 

t=i 


o k )f(net k ) 


d(net k ) 

ty 


(4.24b) 
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We can simp] 
expression (4 


Combining (4(. 
the form 


.23c) and (4.24c) results in rearranging S yj expressed in (4.23a) to 


Sy, = fj'(netj) £ 8 ok w kj , for j = 1, 2 J 

k= i 

The weight adjustment (4.21c) in the hidden layer now becomes 


Wji = t ifj'(netj)Zi S for j = 1, 2, ... ,7 and 

* =1 / = 1 , 2 ,...,/ 


where the jf.' 
as in case of 
generalized d^h 
the hidden 
following lay^i 
fan out from 
S y j of the /th 
4.7. All outpi 
justment of hi 
modified wei 


(net:) 


The hidden 
for the netwoi 


where 
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ify the above expression to the compact form as below by using 
14) for S ok and (4.5b) for net k . 


dE £ o 

dyj~ *?, 8okWkj 


(4.24c) 


(4.25) 


(4.26a) 


]f terms are to be computed either from (4.16d) or from (4.18b) 
[simple delta rule training. Formula (4.26a) expresses the so-called 
ta learning rule . The adjustment of weights leading to neuron j in 
r is proportional to the weighted sum of all 5 values at the adjacent 
r of nodes connecting neuron j with the output. The weights that 
node j are themselves the weighting factors. The weights affecting 
hidden neuron have been highlighted in the output layer in Figure 
layer errors 8 ok w k j , for k = 1, 2, K, contribute to the ad- 
jghlighted weights v Jh for / = 1, 2, . . . , /, of the hidden layer. The 
of the hidden layer can be expressed now as 


ikt 


ights 


[= vji + vfj\netj)z i t 8 ok w kj , for j = 1, 2, . . . , J and (4.26b) 
* =1 i = 1,2,...,/ 


lsjyi 


er weight adjustment based on the generalized delta training rule 
rk in Figure 4.7 can be succinctly stated in vector notation as 


V' = V + tj8 v z' 


^2 

Lz/J 


(4.27) 



Vil 

Vl2 ' 

■■ V U 

V = 

V 2 i 

v 22 ' 

' • V 2 J 


- v j\ 

VJ2 ■ 

■■ V J1- 


and 8 y is the column vector with entries <5 v; given by (4.25). 
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Defining now the /’th column of matrix W as w y, vector 8 V can be expressed 
compactly as follows 

8, = w'8 0 f; (4.28) 

where f' y is the column vector with entries f' y] expressed for each hidden layer 
neuron 1,2, for unipolar and bipolar activation functions, respectively, as 

f'yj = yj( 1 - y } ) (4.29a) 

= (4.29b) 

Vector 8 0 used in (4.28) is defined as in (4.20). 

Comparison of the delta training rule (4.20) for adjusting the output layer 
weights and the generalized delta training rule (4.27) for adjusting the hidden 
layer weights indicate that both formulas are fairly uniform. The significant dif- 
ference is in subscripts referring to the location of weights and input signals, and 
in the way the error signal vector 8 is computed. The vector 8 0 contains scalar 
entries (4.17) or (4.18a). Each component of vector 8 0 is simply the difference 
between the desired and actual output values times the derivative of the activation 
function. The vector 8 >>( however, contains entries that are scalar products wj8 0 /( y - 
expressing the weighted sum of contributing error signals 8 0 produced by the 
following layer. The generalized delta leaming rule propagates the error back by 
one layer, allowing the same process to be repeated for every layer preceding the 
discussed layer j. In the following section we will formalize the training method 
for layered neural networks. 


4.4 — 

FEEDFORWARD RECALL 
AND ERROR BACK- 
PROPAGATION TRAINING 

As discussed in the previous section, the network shown in Figure 4.7 needs 
to be trained in a supervised mode. The training pattem vectors z should be 
arranged in pairs with desired response vectors d provided by the teacher. Let us 
look at a network feedforward operation, or recall. As a result of this operation 
the network computes the output vector o. 


Feedforward Recall 

In general, the layered network is mapping the input vector z into the output 
vector o as follows 


o = N[z] 


(4.30) 
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where N derjotes a composite nonlinear matrix operator. For the two-layer net- 
work shown, the mapping z —> o as in (4.30) can be represented as a mapping 
within a mapping, or 

o = r[wr[Vz]], (4.31a) 

where the inlemal mapping is 

HVz] - y (4.31b) 

and it relater to the hidden layer mapping z — > y. Note that the right arrows 
denote mapping of one Space into another. Each of the mappings is performed 
by a single-layer of the layered network. The operator Y is a nonlinear diagonal 
operator with diagonal elements being identical activation functions defined as in 
(4.3a). The diagonal elements of T operate on net values produced at inputs of 
each neuron. It follows from (4.31) that th e/(-) arguments here are elements of 
vectors net ; and net Å for the hidden and output layers, respectively. 

As can be seen from (4.31b), the assumption of identical and fixed activa- 
tion functions f(net) leads to the conclusion that the only parameters for mapping 
z-4 0 so that o matches d are weights. Specifically, we have two matrices V and 
W to be adjusted so that the error value proportional to ||d — o|| 2 is minimized. 
Thus, we can look at layered neural networks as versatile nonlinear mapping 
systems with weights serving as parameters (Narendra 1990). 

Error Back-propagation Training 

Figure 4 ,8(a) illustrates the flowchart of the error back-propagation training 
algorithm for a basic two-layer network as in Figure 4.7. The leaming begins with 
the feedforward recall phase (Step 2). After a single pattem vector z is submitted 
at the input, the layers’ responses y and o are computed in this phase. Then, the 
error signal computation phase (Step 4) follows. Note that the error signal vector 
must be deterpined in the output layer first, and then it is propagated toward the 
network input nodes. The K X J weights are subsequently adjusted within the 
matrix W in Step 5. Finally, J X / weights are adjusted within the matrix V in 
Step 6. 

Note thai the cumulative cycle error of input to output mapping is computed 
in Step 3 as a sum over all continuous output errors in the entire training set. The 
final error value for the entire training cycle is calculated after each completed 
pass through the training set {z 1? z 2 , . . . ,z F }. The leaming procedure stops when 
the final error value below the upper bound, £ max , is obtained as shown in Step 8. 

Figure 4.8(b) depicts the block diagram of the error back-propagation trained 
network operation and explains both the flow of signal, and the flow of error 
within the network. The feedforward phase is self-explanatory. The shaded por- 
tion of the diagram refers to the feedforward recall. The blank portion of the 
diagram refers to the training mode of the network. The back-propagation of 
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*If f(net) given by (2.4a) is used in Step 2, then in Step 4 use 
s o =[( d k~o k )(i-o k )o k l 
(a) 


Figure 4.8a Error back-propagation training (EBPT algorithm): (a) algorithm flowchart. 
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y Feed forward phase // 

Zz^y/////////?//////. 

Back-propagation phase 


Figure 4.8b Error back-propagation training (EBPT algorithm) (continued): (b) block diagram 
illustrating forward and backward signal flow. 

error d - o From each output, for k = 1, 2, . . . , K, using the negative gradient 
descent technique is divided into functional steps such as calculation of the error 
signal vector 8 C and calculation of the weight matrix adjustment AW of the 
output layer. The diagram also illustrates the calculation of intemal error signal 
vector Sy and of the resulting weight adjustment AV of the input layer. 

The algorithm of error back-propagation training is given below. 


Summary of the Error Back-Propagation Training Algorithm (EBPJA) 

Given are P training pairs 

{zi,d],Z2>d 2 , • • • ,Zp,dp}, 

where is (/ X 1), d, is (K X 1), and i = 1, 2, ..., P. Note that 
the /’th component of each z i is of value — 1 since input vectors have 
been augmented. Size / — 1 of the hidden layer having outputs y is 
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selected. Note that the /’ th component of y is of value — 1, since hidden 
layer outputs have also been augmented; y is (/ X 1) and o is (K X 1). 

Step 1 : 17 > 0, £ max chosen. 

Weights W and V are initialized at small random values; W 
is (K X /), V is (/ X /). 

q<-l 9 p<-l 9 £ 0 

Step 2: Training step starts here (See Note 1 at end of list.) 

Input is presented and the layers’ outputs computed [f(net) 
as in (2.3a) is used]: 

z <- z p , d v- d p 

yj <-/(v'z), for; =1,2 J 

where Vy, a column vector, is the /th row of V, and 
°k <-/(' w*y), for £ — 1, 2, K 
where w*, a column vector, is the £’th row of W. 

Step 3: Error value is computed: 

1 o 

E <— ~(d k - o k ) 2 + E, for k = 1, 2, . . . , K 

Step 4: Error signal vectors 8^ and 8 y of both layers are computed. 
Vector is (K X 1), 8^ is (/X 1). (See Note 2 at end of list.) 
The error signal terms of the output layer in this step are 

S ok = ~ °k)( l - °l\ for k = 1, 2, . . . , K 

The error signal terms of the hidden layer in this step are 

S yj = 1(1 - yj) 1 8 ok w kj , for j = 1, 2 J 

2 J *=1 

Step 5: Output layer weights are adjusted: 

w k j <— w k j + T]8 ok yj, for k = 1,2, ... ,K and 

j — 1,2, ... ,J 

Step 6: Hidden layer weights are adjusted: 

Vji <— v ji + rjSyjZi, for / = 1, 2, ... ,J and 

i= 1,2 ,... ,1 

Step 7: If p < P then p<— p + 1, q *— <7+1, and go to Step 2; 
otherwise, go to Step 8. 
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Step 8: The training cycle is completed. 

For E < £ max terminate the training session. Output weights 
W,V, < 7 , and£. 

If E > £ max , then E 0, p <— 1, and initiate the new training 
cycle by going to Step 2. 

h NQTE 1 For best results, pattems should be chosen at random 
from the training set (justification follows in Section 4.5). 

■■ NOTE 2 If formula (2.4a) is used in Step 2 , then the error 
signal terms in Step 4 are computed as follows 

S ok = C d k - - °k)°b for k = 1, 2, . . . , K 

K 

8 yj = yÅ l " yy> I 8 ok w kp for j = 1 , 2, . . . , J 


Several aspects of the error back-propagation training method are noteworthy. 
The incremental leaming of the weight matrix in the output and hidden layers is 
obtained by i:he outer product rule as 

AW = r? 8 y' 

where 8 is the error signal vector of a layer and y is the input signal vector to that 
layer. Noticeably, the error signal components S 0 k at the output layer are obtained 
as simple scalar products of the output error component d k — o k and f'(net k ). In 
contrast to thjis mode of error computation, hidden-layer error signal components 
S y j are computed with the weight matrix W seen in the feedforward mode, but 
now using it$ columns w ; . As we realize, the feedforward mode involves rows 
of matrix W for the computation of the following layer ’s response. 

Another observation can be made regarding linear versus nonlinear opera- 
tion of the network during training and recall phases. Although the network is 
nonlinear in the feedforward mode, the error back-propagation is computed using 
the linearized activation function. The linearization is achieved by extracting the 
slope f’(net) at each neuron ’s operating point and using it for back-transmitted 
error signal scaling. 

] 

Examplé of Error Back- 
Propagation Training 

The following example demonstrates the main features of the error back- 
propagation fraining algorithm applied to a simple two-layer network. 
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EXAMPLE 4.2 


In this example we will perform a single training step for the three-neuron, 
two-layer network shown in Figure 4.9(a). Let us note that the original 
network has been augmented in the figure with dummy neurons 1 and 2 
to conform to the general network structure as illustrated in Figure 4.7. 
Obviously, the Virtual inputs to the network have to be assigned values 
o x and o 2 , which are the responses of the dummy neurons. A single two- 
component training pattem is assumed to be present at inputs o x and o 2 . 
In addition, dummy neuron 0 with a fixed output of - 1 has been added to 
generate the weights w 30 , w 40 , and w 50 corresponding to the thresholds T 3 , 
T 4 , and r 5 , respectively, for actual neurons 3, 4, and 5. Its only function is 
to augment the input vectors to each neuron by a fixed bias component. 

For the sake of clarity and to avoid identical indices for different 
weights in this example, the hidden layer neurons in column j have been 
numbered 3 and 4, and the output neuron has been numbered 5. The leaming 
starts after all weights have been randomly initialized. Inputs o x and o 2 are 
presented and o 3 , o 4 , and o 5 are computed with o 0 = — 1. This feedforward 
recall phase more than likely results in a nonzero error E defined as in (4.4). 



i=l,2 

Dummy 

neurons 

(inputs) 


7=3,4 


ø,=(l+<r”V 

1=3, 4,5 


(a) 


Figure 4.9a Figure for Example 4.2: (a) network diagram. 
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The error signal term value 8 5 can now be computed using (4.14), or for 
the seleeted activation function using formula (4.17) as follows: 

85 = W5 ~~ 05)05(1 — 05) 

The generalized delta values S 3 and S 4 as in (4.25) need to be expressed 
for hidcbn layer neurons 3 and 4: 


Using (- 


83 =f sinet-s ) I S k w k3 

k = 5 
5 

84 = f\(net A ) X S*w* 4 

k=5 

.29a) yields the final expressions for <S 3 and 8 4 as 

83 = ° 3(1 ° 3 )^ 5 VV 53 

S 4 = o 4 (l — o 4 )S 5 w S4 


(4.32a) 


(4.32b) 


Now the weight vector corrections Aw kj - and A Wjj can be computed for the 
selected rj value. Output layer weight adjustments computed from (4.10) are 

Aw 50 = ~v8 s 

Aw 53 = r)S 5 o 3 (4.33a) 

Aw 54 = T ]S 5 o 4 

The adjustments of hidden-layer weights associated with the neurons 3 and 
4 computed from (4.21c) or (4.26a) are, respectively, 

(4.33b) 


and 


Aw 30 .= 

-t?S 3 

Aw 31 = 

17630 , 

Aw 32 — 

7]8 3 0 2 

il 

g 

£ 

<1 

-1784 

Aw 41 = 

17840 , 

Aw 42 = 

778402 


(4.33c) 


Superposition of the computed weight adjustments (4.33) and of the present 
weights w kj and results in the updated weight values. This terminates the 
single stipervised leaming step based on the single pattem vector from the 
training set. Subsequently, the next pattem vector is submitted and the train- 
ing step repeated. The training steps proceed until all pattems in the training 
set are øxhausted. This terminates the complete training cycle understood 
as the siveep through the sequence of P training steps, where P denotes the 
number I of training pattems in the training set. 

Thø cumulative cycle error is computed for the complete training cycle 
using expression (4.4). It is then compared with the maximum error allowed. 
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Class 2 c . 

5 * 



Input to image mapping 


Image to output mapping 


Figure 4.9b, c Figure for Example 4.2 (continued): (b) space transformations, run 1, and 
(c) space transformations, run 2. 



194 


MULTILAYER FEEDFORWARD NETWORKS 


The training cycles repeat until the cycle error drops below the specified 
maximi im error value. If the network has failed to leam the training set 
successfully, the training should be restarted with different initial weights. 
If this also fails, other remedies need to be taken to meet the training 
requirements. Such steps are discussed in Section 4.5 in more detail. 

The network covered in this example can also be used to demonstrate 
the classification of linearly nonseparable pattems. Due to its size, it can be 
used directly to solve the XOR problem for two variables. Network training 
has been simulated and two sample runs with random initial weight values 
are summarized below for 77 = 0.1. Matrices W and V are defined for this 
network as 

W=[w 50 W53 w 54 ] 
y A W30 W 3l W 32 

[W40 W 41 W 42 . 

The resulting weight matrices obtained for run 1 (1244 steps) with random 
initial Weights are: 

Wj = [-3.328 6.898 -6.584] 


3.116 5.545 -6.094" 

-0.739 5.467 -3.178 


The resulting weight matrices obtained for run 2 (2128 steps) with another 
random set of initial weights are: 

W 2 =[ -3.967 -8.160 -5.376] 

= [ 6.169 3.854 4.28 1' 

V2 [-1.269 -4.674 -4.578 _ 


Since we require that the network from Figure 4.9(a) to function as a clas- 
sifier w ith binary outputs, let us preserve all its weights as computed above. 
Continuous perceptrons used for training, however, need to be replaced 
with bipolar binary neurons. This will result in a binary-valued response 
that provides the required class numbers of values 1 or 0 for classes 1 and 
2, respectively. 

Based on the result obtained, we can analyze the implemented mapping 
of input-to-image-to-output space. As a result of run 1, perceptrons 3 and 4 
implement the following mapping to image space ø, and ø 2 , respectively, 

5.545ø! - 6.094 ø 2 - 3.116 = 0 
5.467ø, - 3.178 ø 2 + 0.739 = 0 

The generated decision lines are shown in Figure 4.9(b). The reader can 
verify (hat the decision lines are obtained using the rows of computed ma- 
trix V,, and also under the assumption that net 3 and net 4 of zero value 
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determine the respective partitioning lines. In addition, perceptron 5 per- 
forms the image-to-output space mapping and the equation of the class 
decision line is as follows: 

6.898ø 3 - 6.584o 4 + 3.328 - 0 

The line is also shown in the figure. It can be seen that mapped pattems 
B,D and A,C are positioned at different sides of the decision line, which 
is generated by the output perceptron in the image space. This part of the 
example demonstrates how linearly nonseparable original pattems can be 
correctly classified by this classifier. Interestingly, these results have been 
produced as a result of error back-propagation network training. 

Inspecting the results of run 2, the reader can easily verify that the 
decision surfaces generated by perceptrons 3, 4, and 5 are as shown in 
Figure 4.9(c). Markedly, it can be seen that the results of run 2 are very 
similar to those obtained in Example 4.1 without training the classifier net- 
work. Reviewing the mappings from Figures 4.9(b) and (c), note that they 
both merge pattems of the same class to the same compartment in the image 
space. For the decision lines as shown and an activation function with Å = 1 
used for training the network, the output values close to 0 or 1 need to be 
thresholded to values of 0 or 1 to yield binary classification response. As 
an alternative, A can be increased and outputs, in limit, will approach the 
0, 1 levels. ■■ 


Training Errors 

Every supervised training algorithm covered so far in the text, including 
the error back-propagation training algorithm, involves the reduction of an error 
value. For the purpose of weight adjustment in a single training step, the error 
to be reduced is usually that computed only for a pattem currently applied at the 
input of the network. For the purpose of assessing the quality and success of 
the training, however, the joint error must be computed for the entire batch of 
training pattems. The discussion below addresses the main definitions of errors 
used for evaluation of neural network training. 

Note that the cumulative error is computed over the error back-propagation 
training cycle (Step 3, EBPTA), and it is expressed as a quadratic error 

E= l I l(d pk ~o pk ) 2 (4.34) 

2p=\ k =\ 

This error is a sum of P errors computed for single pattems using formula (4.4). 
It can be seen that the error of Eq. (4.34) depicts the accuracy of the neural 
network mapping after a number of training cycles have been implemented. Such 
definition of error, however, is not very useful for comparison of networks with 
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different nujnbers of training pattems P and having a different number of output 
neurons. Networks with the same number of outputs K when trained using large 
numbers of | pattems in the training set will usually produce large cumulative 
errors (4.34) due to the large number of terms in the sum. For similar reasons, 
networks with large K trained using the same training set would usually also 
produce larie cumulative errors. Thus, a more adequate error measure can be 
introduced ^s in (4.35): 

( 4 . 35 ) 

The value E ^ has the sense of a root-mean-square normalized error ; and it 
seems to be more descriptive than E as in (4.34) when comparing the outcome 
of the trainipg of different neural networks among each other. 

In somp applications, the networks’ continuous responses are of significance 
and thus any of the discussed error measures E and £ ms bear useful informa- 
tion. The dfcgree of association or the accuracy of mapping can be measured 
by these continuous error measures. In other applications, however, the neurons’ 
responses are assigned binary values after the thresholding. These applications 
include classifier networks. For example, all unipolar neurons responding below 
0.1 and aboVe 0.9 can be considered as approximating binary responses 0 and 1, 
respectively 

Assumgng that the network is trained as a classifier, usually all the desired 
output values can be set to zero except for the one corresponding to the class the 
input pattem is from. That desired output value is set to 1. In such cases, the de- 
cision error , rather than the continuous response errors, more adequately reflects 
the accuracy of neural network classifiers. The decision error can be defined as 

E d = ( 4 . 36 ) 

where iV err is the total number of bit errors resulting at K thresholded outputs 
over the cornplete training cycle. Note that networks in classification applications 
may perforrh as excellent classifiers and exhibit zero decision errors while still 
yielding substantial E and £ ms errors. 


Multilayer Feedforward Networks as 
Universal Approximators 

| 

Although classification is a very important form of neural computation, 
the binary rpsponse only of neural networks would seriously limit their mapping 
potential. Our focus in this section is to study the performance of multilayer feed- 
forward netjworks as universal approximators. As we discussed early in Section 
2.4, the problem of finding an approximation of a multivariable function h(x) can 
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Figure 4.10 Approximation of h(x) with staircase function H(w,x). 


be approached through supervised training of an input-output mapping from a set 
of examples. The leaming proceeds as a sequence of iterative weight adjustments 
until the solution weight vector w* is found that satisfies the minimum distance 
criterion (2.26). 

An example function h(x ) to be approximated is illustrated in Figure 4.10. 
Assume that P samples of the function are known on a set of arguments {x l9 
jc 2 , . . . , x P }. The samples are simply examples of function values in the interval 
(a,b). It is assumed for simplicity that the example arguments are uniformly 
distributed between a and b and are Ax apart, i.e. 


^ ^ 

x i+i - x,- = Ax = — p— . for i = 1, 2, . . . , P 

We note that the interval ( a,b ) is divided into P equal intervals of length Ax 
defined as 



Ax Ax\ 

T' * + Ti ’ 


for i = 1, 2, . . . , P 




where 
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The functiorl values h t = h(x t ) at argument values determine the height of each 
of the P rectangles depicted on the figure. This way we obtain a staircase approx- 
imation H(\i(,x) of the continuous-valued function h(x). Let us define a function 
£(x) such th it 

1 1 ( 0 for x < 0 

£(x) = - sgnx + - = < undefined for jr = 0 (4.37) 

U for x > 0 

The function £(x) is called the unit step function. The staircase approximation 
can be exprøssed using the unit step functions as follows 


H( w,x) = h x £ ( x - x x + — \ - { Ix-Xi 


+ h P K ( x — x P + 


- Hx-xp- 


or, briefly 

H(w,x) = X h t {(x- Xi + ^ x i ~ y) ( 4 - 38b ) 

We note that each term in brackets in (4.38b) is a unity height window of width 
Ax centered at x, as shown by the continuous line in Figure 4.11(a). This window 
expressed ip terms of the sgn (•) function using (4.37) can be rewritten as 

Ax\ 1 ( Ax 

x ~ x '~y) = 2 ss "[. x -*‘ + T 

- \ W - J=1 - 


/ I Ax 
C (x - -V, + - 



Figure 4.11(b) illustrates how to implement Equation (4.39) by using two TLU 
elements wi th appropriate thresholds and summing their properly weighted re- 
sponses. It can thus be seen that two binary perceptrons are needed to produce 
a single window. We may notice that a network with 2 P individual TLU ele- 
ments similir to the one shown in Figure 4.11(b) can be used to implement the 
staircase approximation (4.38). An example of such a network is illustrated in 
Figure 2.9(a). 

Note that the network of Figure 4.11(b) has two constant inputs x ( and -1. 
Merging them would result in a functionally identical network, which is shown 
in Figure 4J2(a). Binary response units TLU #1 and TLU #2 are now replaced 
with the bipolar continuous perceptrons as shown in Figure 4.12(b). As a result, 
the mapping of input x into output o takes the form of a single bump centered at 
Xj rather thap of the rectangular window. The bump is shown in Figure 4.11(a) by 
the dashed line. Let us also note that with the increasing steepness factor A, the 
dashed line is getting doser to the continuous one, and the bump approaches the 
rectangular window. In faet, even the inerease of A is not necessary because it 
can be replaced with scaling the weight up by using the multiplicative factor A 



for the weights shown in Figure 4.12(b). In two or more dimensions, we could 
add one function producing a bump for each dimension. Figure 2.9(c) provides 
here an appropriate illustration. We could also combine peaks as in (4.38) for a 
staircase approximation of a multivariable function H( w, x). Although the bump 
approach may not be the best for a particular problem, it is intended only as 
proof of existence. 

The preliminary and rather nonrigorous considerations above indicate that a 
sufficient number of continuous neurons can implement a finite sum of localized 
bumps spread over the multidimensional domain of the argument space x. The 
construction presented indicates that multivariable functions h(\) can be modeled 
by two-layer networks. Whether or not the approximations are leamable and the 
approximating weights w* can be found by training remains an open question. 
We also do not know exactly what may be the best network architectures and 
whether or not the number of 2 P units in the first layer is adequate or if the 
number is excessive. 

In a more formal approach, networks of the type shown in Figure 4.12 
can be used to map R n into R by using P examples of the function h(x) to be 







(a) (b) 

Figure 4.12 Networks implementing the window function of Eq. (4.39): (a) with binary term 
(4.39) and (b) yvith continuous neurons [dashed line in Figure 4.11(b)]. 


approximated by performing the nonlinear mapping with continuous neurons in 
the first layer as folio ws 

y = T[Vx] (4.40a) 

and by computing the linear combination by the single node of the output layer 

o = H(x) = w'y (4.40b) 

Although the concept of nonlinear mapping (4.40a) followed by linear 
mapping (4.40b) pervasively demonstrates the approximating potential of neural 
networks, most theoretical and practical technical reports deal with the second 
layer also providing the nonlinear mapping (Poggio and Girosi 1990; Funanashi 
1989; Homix, Stinchcombe, and White 1989). 

The general network architecture performing the nested nonlinear scheme 
was shown earlier in Figure 4.7. It consists of a single hidden layer and it 
implements K mappings, each mapping being the component of o* 

o* = r[wrtvo]] (4.4i) 

This standard class of neural networks architecture can approximate virtually 
any multivariable function of interest, provided sufficiently many hidden neurons 
are availablé. The studies of Funanashi (1989), Homik, Stinchcombe, and White 
(1989) prové that multilayer feedforward networks perform as a class of universal 
approximatoirs. The results also provide a fundamental basis for establishing the 
ability of multilayer feedforward networks to leam the connection strengths that 
achieve the desired accuracy of the approximation. The reader is referred to the 
specialized Uterature for more details (Homik, Stinchcombe, and White 1989; 
Funanashi 1989; Poggio and Girosi 1990). 

Failures in approximation application can, in general, be attributed to inade- 
quate leaming. The reasons for failure quoted in the literature are an inadequate 
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(too low or too high) number of hidden neurons, or the lack of a determin- 
istic relationship between the input and target output values used for training. 
If functions to be approximated are not bounded and h(\) cannot be properly 
scaled, the use of the linear mapping by the second layer as in (4.40b) may 
offer the solution. Rather than assuming the ordinary ‘signoidal’ activation func- 
tions, simple identity activation functions of the output neurons can be used in 
such cases. 

The function values that need to be used as desired output values during 
training of the network from Figure 4.7 with neurons’ response bounded must 
also be bounded. The range of function values is (—1,1) for bipolar continu- 
ous activation functions and (0, 1) for unipolar continuous activation functions. 
Several practical demonstrations of neural network approximation capability are 
presented in Chapter 8 in relation to their applications to robotics and control 
systems. The example below discusses the main features of the function approx- 
imation issues. 


EXAMPLE 4.3 


In this example we will review an application of a three-neuron network for 
approximation of a two-variable function. The network to be trained has the 
architecture shown in Figure 4.9(a). Bipolar continuous activation functions 
are used for the network training. Weight matrices W and V are defined as 
in Example 4.2. 

Let us attempt to train the network to compute the length of a planar 
vector with components o l and o 2 . This yields the desired value for o 5 as 
folio ws: 

d = yjo\ + o\ (4.42) 

Using (4.41) we can write 

o 5 = r[WT[Vo]] (4.43) 

where 

o = [-1 o l o 2 ]' 

The weights w k j and v yY need to be adjusted during training. The input 
domain for training has been chosen to be the first quadrant of the plane 
o x ,o 2 with 0 < Oi < 0.7, for i = 1, 2. 

In the first experiment only 10 training points have been selected. They 
have been uniformly spread only in the lower half of the first quadrant. 
The network has reached an acceptable error level established at 0.01 after 
2080 training steps with rj = 0.2. Matrices W and V obtained from the 
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simulation are 



'0.03' 

w = 

3.66 


_2.73_ 

V — 

-1.29 

V — 

0.97 


(4.44) 

-3.04 -1.54' 

2.61 0.52 


To check how the network has leamed the examples from the training 
set, the complete surface constructed by the trained network has been gener- 
ated. The magnitude of the mapping error computed as \d — o 5 \ is illustrated 
in Figur^ 4.13(a). Magnitudes of error associated with each training pattem 
are also shown on the surface. 

The results produced by the network indicate that the mapping of the 
function has been leamed much better within the training domain than out- 
side it. thus, any generalization provided by the trained network beyond the 
training domain remains questionable. This is particularly vivid for the left 
upper cqmer of the quadrant where the discrepancy between the accurate 
function value and its approximation reaches 0.45. Also, pattems that are 
clustered in the training domain are mapped more accurately. Therefore, the 
recall of the training pattems that are at the horders of the training area does 
not yield as low an error as recalled pattems, which are located inside it. 

Figure 4.13(b) provides more detailed insight into the mapping property 
of this network. It displays the error d — o 5 in a form of the contour map. 
The map indicates that an ideal approximation contour for d = o 5 would 
be locatdd somewhere between the contours —0.02 and 0.01. Note that the 
10 training points are marked as circles on the graph. 

The training experiment was then repeated for the same architecture but 
with 64 training points now densely covering the entire quadrant domain. 
The network has reached an error level of value 0.02 after 1200 training 
steps with 7] = 0.4. The weights obtained as a result of training are 


W = [ —3.47 -1.8 2.07] 

[-2.54 -3.68 0.61] (4-45) 

V [ 2.76 0.07 3 - 83 . 


The mapping performed by the network is shown in Figure 4.14(a). This 
figure shows the result of the surface reconstruction and has been obtained 
as a serifes of numerous recalls for points o x and o 2 densely covering the 
first quadrant, which is our domain of interest. Figure 4.14(b) provides the 
detailed dontour error map of the error d — o 5 for the designed network. The 
training joints for this experiment have been selected at the intersection of 
the meshj. The mapping is reasonably accurate in the entire domain; however, 
it tends to get worse for points approaching the boundaries of the training 
domain. 
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As the last in a series of computational experiments, the architecture 
of the network has been changed to contain 10 hidden neurons (J = 10). 
The same set of 64 training points has been used. The network has reached 
a desired error level of value 0.015 after 1418 training steps with 77 = 0.4. 
Weights obtained are 

W 4 = [ -2.22 -0.30 -0.30 -0.47 1.49 -0.23 1.85 

-2.07 -0.24 0.79 -0.15] (4.46) 

0|.57 0.66 -0.10 -0.53 0.14 1.06 -0.64 -3.51 -0.03 0.01 
0 t 64 -0.57 — 1.13 -0.11 -0.12 -0.51 2.94 0.11 -0.58 -0.89 


V = 


The mapping obtained in this case is displayed in Figure 4.15(a), and the 
respective error contour map is provided in Figure 4.15(b). The overall qual- 
ity of the mapping is comparable to that shown in Figure 4.14(b). However, 
the network training required much more CPU time than for the case of 
two neurons only in the hidden layer. H 


learniNg factors 

The back-propagation leaming algorithm in which synaptic strengths are 
systematically modified so that the response of the network increasingly ap- 
proximates the desired response can be interpreted as an optimization problem. 
The generic criterion function optimization algorithm is simply negative gradient 
descent with a fixed step size. The output error function (4.4), which serves as 
an objective 1 ‘unction, is defined in the overall weight space, which has /(/ + K ) 
dimensions. The leaming algorithm modifies the weight matrices so that the error 
value decreases. 

The essence of the error back-propagation algorithm is the evaluation of the 
contribution c f each particular weight to the output error. This is often referred to 
as the problem of credit assignment. Since the objective function of a neural net- 
work contains continuously differentiable funetions of the weights, the evaluation 
of credit assignment can be easily accomplished numerically. As a reminder, note 
that this evaluation would not have been possible without replacing the discrete 
perceptrons with continuous perceptrons. 

It might appear that the error back-propagation algorithm has made a break- 
through in supervised leaming of layered neural networks. In practice, how- 
ever, implementation of the algorithm may encounter different difficulties. The 
difficulties are typical of those arising in other multidimensional optimization 
approaches. One of the problems is that the error minimization procedure may 
produce only a local minimum of the error function. Figure 4.16 shows a typical 
cross section of an error space in a single weight dimension. It can be seen that 
the error is u nonnegative function of the weight variable. The ideal mapping 


would reducel 


£ra, s to zero. 
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In practice, however, the leaming would be considered successful for £’ rms 
below an acceptable minimum £ ms value. The error function shown in Figure 
4.16 possesses one global minimum below the min £ ms value, but it also has two 
local minima at w n and w /2 , and one stationary point at w s . The leaming proce- 
dure will stop prematurely if it starts at point 2 or 3; thus the trained network 
will be unable to produce the desired performance in terms of its acceptable 
terminal error. To ensure convergence to a satisfactory minimum the starting 
point should be changed to 1. Moreover, there is a question of how long it might 
take a network to leam. An appropriate choice of the leaming parameters should 
guarantee that a good quality solution is found within a reasonable period of 
computing time. 

Although the negative gradient descent scheme and all other optimization 
techniques can become stuck in local minima of the error function, local min- 
ima have not been much of a problem in many of the training cases studied. 
Since these minima are not very deep, inserting some form of randomness to the 
training may be sufficient to get out. 

However, a more convincing explanation for reasons why the local minima 
are not a major problem in this training procedure has its background in the 
stochastic nature of the algorithm. The square error surfaces produced are ran- 
dom. The larger the neural network, the better should be the training outcome. 
In faet, the error back-propagation technique has been found to be equivalent to 
a form of stochastic approximation explored in the early 1960s (Tsypkin 1973). 
The major novel aspect of the algorithm is that it is computationally efficient for 
the empirical study of stochastic approximation techniques (White 1989). 

One of the factors that usually improves the convergence of training is the 
statistical nature of inputs and outputs, which may be realizations of two some- 
what related random processes. Also, even when inputs are purely deterministic, 
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superposition of noise with the zero mean value can increase the efficiency of 
the training process. However, in absence of solid theories, the neural network 
modeler shoi ld often rely on experimentation and on under standing of basic 
network princfiples to achieve satisfactory results. 

The discussion below continues to address the issue of convergence of the 
back-propagation algorithm. We will also focus on several main aspects and 
some practical properties of the algorithm. The most important of these are ini- 
tial weights, kumulative weight adjustment, the form of the neuron’ s activation 
function, and selection of the leaming constant and momentum term. We will 
also discuss selected aspects of the network architecture that are relevant for 
successful training. 


Initial Woights 

The weights of the network to be trained are typically initialized at small 
random valueis. The initialization strongly affects the ultimate solution. If all 
weights start out with equal weight values, and if the solution requires that 
unequal weights be developed, the network may not train properly. Unless the 
network is disturbed by random factors or the random character of input pattems 
during training, the intemal representation may continuously result in symmetric 
weights. 

Also, the network may fail to leam the set of training examples with the 


error stabilizihg or even increasing as the leaming continues. In faet, many em- 
pirical studies of the algorithm point out that continuing training beyond a certain 
low-error plateau results in the undesirable drift of weights. This causes the error 
to increase ahd the quality of mapping implemented by the network decreases. 
To counteract the drift problem, network leaming should be restarted with other 


random weights. The choice of initial weights is, however, only one of several 
factors affeeting the training of the network toward an acceptable error minimum. 


Cumulatjve Weight Adjustment versus 
Incremerttal Updating 


As stated before, the error back-propagation leaming based on the single 
pattem error reduction (4.4) requires a small adjustment of weights which fol- 
lows each præsentation of the training pattem. This scheme is called ineremental 
updating. As jshown by McClelland and Rumelhart (1986), the back-propagation 
leaming also j implements the gradient-like descent minimization of the overall 
error function as defined in (4.34) computed over the complete cycle of P pre- 
sentations, provided the leaming constant 7j is sufficiently small. 
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The advantage of minimization of the cuirent and single pattem error as 
illustrated in the flowchart of Figure 4.8 and summarized in the EBPT algorithm 
of Section 4.4 is that the algorithm implements a true gradient descent downhill 
of the error surface. Moreover, during the computer simulation, the weight ad- 
justments determined by the algorithm do not need to be stored and gradually 
compounded over the leaming cycle consisting of P joint error signal and weight 
adjustment computation steps. The network trained this way, however, may be 
skewed toward the most recent pattems in the cycle. To counteract this specific 
problem, either a small leaming constant r) should be used or cumulative weight 
changes imposed as follows 

p 

Aw = X Ah’ 
i 

for both output and hidden layers. The weight adjustment in this scheme is 
implemented at the conclusion of the complete leaming cycle. This may also 
have an averaging effect on the training, however. Provided that the leaming 
constant is small enough, the cumulative weight adjustment procedure can still 
implement the algorithm close to the gradient descent minimization. 

Although both cumulative weight adjustment after each completed training 
cycle or incremental weight adjustment after each single pattem presentation can 
bring satisfactory solutions, attention should be paid to the faet that the training 
works best under random conditions. It would thus seem advisable to use the 
incremental weight updating after each pattem presentation, but choose pattems 
in a random sequence from a training set. This introduces mueh-needed noise 
into the training and alleviates the problems of averaging and skewed weights 
which would tend to favor the most recent training pattems. 


Steepness of the Activation Function 


As introduced in Chapter 2, the neuron’s continuous activation function 
f(net, A) is characterized by its steepness factor A. Also, the derivative f'(net) of 
the activation function serves as a multiplying factor in building components of 
the error signal vectors 8 0 and 8 r Thus, both the choice and shape of the 
activation function would strongly affeet the speed of network leaming. 

The derivative of the activation function (2.3a) can be easily computed as 
follows: 


f'(net) 


2A exp(— A net) 

[l + exp(— \net)] 2 


(4.47) 


and it reaches a maximum value of Vik at net = 0. Figure 4.17 shows the 
slope function of the activation function and it illustrates how the steepness A 
affeets the leaming process. Since weights are adjusted in proportion to the value 
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Figure 4.17 Slope of the activation function for various A values. 

f'(net), the weights that are connected to units responding in their midrange are 
changed the most. The weights of uncommitted neurons with uncertain responses 
are thus affected more strongly than of those neurons that are already heavily 
tumed on or tumed off. Since the local signal errors 8 ok and S y j are computed 
with f'(net) as a multiplier, the transmitted components of the back-propagating 
error are large only for neurons in steep thresholding mode. 

The other feature apparent from Figure 4.17 is that for a fixed leaming con- 
stant all adjustments of weights are in proportion to the steepness coefficient A. 
This particular observation leads to the conclusion that using activation functions 
with large A may yield results similar as in the case of large leaming constant 
rj. It thus seems advisable to keep A at a standard value of 1, and to control the 
leaming speed using solely the leaming constant rj, rather than controlling both 
rj and A. 


Leaming Constant 

The effectiveness and convergence of the error back-propagation leaming 
algorithm depend significantly on the value of the leaming constant r j. In gen- 
eral, however, the optimum value of tj depends on the problem being solved, 
and there is no single leaming constant value suitable for different training cases. 
This problem seems to be common for all gradient-based optimization schemes. 
While gradient descent can be an efficient method for obtaining the weight values 
that minimize an error, error surfaces frequently possess properties that make the 
procedure slow to converge. 

When broad minima yield small gradient values, then a larger value of rj will 
result in a more rapid convergence. However, for problems with steep and narrow 
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minima, a small value of 77 must be chosen to avoid overshooting the solution. 
This leads to the conclusion that 77 should indeed be chosen experimentally 
for each problem. One should also remember that only small leaming constants 
guarantee a true gradient descent. The price of this guarantee is an increased 
total number of leaming steps that need to be made to reach the satisfactory 
solution. It is also desirable to monitor the progress of leaming so that 77 can be 
increased at appropriate stages of training to speed up the minimum seeking. 

Although the choice of the leaming constant depends strongly on the class 
of the leaming problem and on the network architecture, the values ranging from 
1 (T 3 to 10 have been reported throughout the technical literature as successful for 
many computational back-propagation experiments. For large leaming constants, 
the leaming speed can be drastically increased; however, the leaming may not be 
exact, with tendencies to overshoot, or it may never stabilize at any minimum. 

Even though the simple gradient descent can be efficient, there are situa- 
tions when moving the weights within a single leaming step along the negative 
gradient vector by a fixed proportion will yield a minor reduction of error. For 
flat error surfaces for instance, too many steps may be required to compensate 
for the small gradient value. Furthermore, the error contours may not be circular 
and the gradient vector may not point toward the minimum. Some heuristics for 
improving the rate of convergence are proposed below based on the observations 
just discussed. 


Momentum Method 

The purpose of the momentum method is to accelerate the convergence of 
the error back-propagation leaming algorithm. The method involves supplement- 
ing the current weight adjustments (4.5) and (4.21a) with a fraction of the most 
recent weight adjustment. This is usually done according to the formula 

A w(t) = -t ]VE(t) 4- aA w(t - 1) (4.48a) 

where the arguments t and t — 1 are used to indicate the current and the most 
recent training step, respectively, and a is a user-selected positive momentum con- 
stant. The second term, indicating a scaled most recent adjustment of weights, is 
called the momentum term. For the total of N steps using the momentum method, 
the current weight change can be expressed as 

Aw(r) = -77 I cfVEit - n) (4.48b) 

n = 0 

Typically, a is chosen between 0.1 and 0.8. Figure 4.18 illustrates the momentum 
method heuristics and provides the justification for its use. 

Let us initiåte the gradient descent procedure at point A'. The consecutive 
derivatives SE/ dw x and dE/dw 2 at training points A\ A", ..., are of the same 
sign. Obviously, combining the gradient components of several adjacent steps 
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Figure 4.18 Illustration of adding the momentum term in error back-propagation training for a 
two-dimensional case. 


would result in convergence speed-up. If the error surface has a smooth variation 
along a certain axis, the leaming rate along this particular component should thus 
be increased. By adding the momentum term, the weight adjustment in A" is 
enhanced by the fraction of the adjustment of weights at A f . 

After starting the gradient descent procedure at B\ the two derivatives 
dE/dw x and dE/dw 2 , initially negative at B\ both alter their signs at B" . The 
figure indicates that the negative gradient does not provide an efficient direction 
of weight adjustment because the desired displacement from B” should be more 
toward the minimum Af, or move the weight vector along the valley rather than 
across it. As seen from the figure, the displacement r]VE(t+ 1) at B n would move 
the weights by a rather large magnitude almost across the valley and near the 
starting point B'. Moving the weights by -7jV£(f+l) + aAw(Y), however, reduces 
the magnitude of weight adjustment and points the resulting vector A w(t 4-1) 
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more along the valley. One can thus infer that if the gradient components change 
signs in two consecutive iterations, the leaming rate along this axis should be 
decreased. This discussion thus indicates that the momentum term typically helps 
to speed up convergence, and to achieve an efficient and more reliable leaming 
profile (Jacobs 1988). 

To gain some further insight into the momentum method, let us look at the 
comparison of the performance of the error back-propagation technique without 
and with the momentum term. The comparative study of both techniques has been 
carried out using the scalar valued function of two-dimensional weight vector as 
folio ws: 

E( w) = (aw x ) 2 + w\ 

By varying a in the error expression above, it is possible to shape the error sur- 
face. Values of a that are close to unity ensure circular error contours in contrast 
to small values, which produce a narrow valley. For a = 0.02, different angles 
of rotation of the valley axis relative to the w x axis can be generated through a 
simple rotational transform. The basic error back-propagation technique without 
and with the momentum method has been used to seek the minimum of E( w) 
with the final error below 10” 3 and for initial weights of (100,2). The following 
table summarizes the number of computational steps for different values of 17 
and a (Silva and Almeida 1990). 


Rotation 

0° 10° 20° 30° 45° 

EBPT, tj = 0.5 

10367 

EBPT with momentum 

5180 

a = 0.5 

tj = 0.5 

a = 0.9 

tj = 0.5 

1007 

a — 0.9 

tj = 0.05 

10339 


We see that a twofold and even tenfold increase of training speed has been 
observed. The result shows that the inclusion of the momentum term can consid- 
erably speed up convergence when comparable 17 and a are employed compared 
to the regular error back-propagation technique. Although this discussion has 
only used w as symbol of weight adjusted using the momentum method, both 
weights Vjf and w k j can be adapted using this method. Thus, the momentum term 
technique can be recommended for problems with convergence that occur too 
slowly or for cases when leaming is difficult to achieve. 
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NetWork Architectures Versus 
Data Representation 

One of the most important attributes of a layered neural network design is 
choosing the architecture. In this section we will study networks with a single 
hidden layer of J neurons, and with an output layer consisting of K neurons. In 
accordance with Figure 4.7, the network has / input nodes. Let us try to deter- 
mine the guidelines for selection of network sizes expressed through /, J , and K. 
The number of input nodes is simply determined by the dimension, or size, of 
the input vector to be classified, generalized or associated with a eertain output 
quantity. The input vector size usually corresponds to the number of distinet 
features of thé input pattem. In this discussion the input vector is considered as 
nonaugmented. 

In the case of planar images, the size of the input vector is sometimes made 
equal to the total number of pixels in the evaluated image. In another approach, 
however, it may be a vector of size two only. In such cases, a trade-off usually 
exists between the number of training pattems P needed to create the training 
set and the dimension I of the input pattem vector. To illustrate this with an 
example, assume three planar training images, characters C, /, and T, represented 
on a 3 X 3 grid. Using the first of the two approaches, nine pixels can be used 
to depict each of the characters as shown in Figure 4.19. On the figure, pixels 
of character C are represented using this approach. Assuming that the size of 
the input pattérn vector is nine, we have P = 3 and the following three training 
vectors in the training pattem set {x 1 ,x 2 ,x 3 }: 

x, = [1 1 1 1 0 0 1 1 1]' : class C 

x 2 = [0 1 0 0 1 0 0 1 0] f : class I 

x 3 = [1 1 1 0 1 0 0 1 0]': class T 



n = 9 n= 2 

P-3 P=9 


Figure 4.19 Illustration for trade-off between the size of the pattem vector and the number of 
necessary training pattems. 
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assuming that white and black pixels correspond to vector components 0, 1, 
respectively. 

Altematively, the dimensionality of the planar input images can be pre- 
served as equal to only two. In such a case, however, the number of training 
pattems P has to be made equal to the number of pixels. Pixels of character 
I on the figure are labeled according to this method of pattem representation. 
It can now be seen that we have P = 9. Without any feature extraction or 
compression of original training data, the training pattem set can be written now 
as {x l9 x 2 , x 9 }, where 

Xj = [ 1 1 ]* : class C,T 

x 2 = [ 1 : class C,I,T 

x 9 = [ 3 3 ] 1 : class C 

The reader may also notice that relaxation of the requirements for dimensionality 
of input pattems leads to further interesting conclusions. Why not, for instance, 
represent the three pattems from Figure 4.19 as a set of P = 27 pattems, each 
of them considered to be of single dimension? We now have arrived at three 
different input representations, each of them being as sensible as the other for 
this simple problem. Lacking the guidelines regarding the choices and trade-offs 
between P and n , we may have to approach the problem of input representation 
with flexibility and be ready to experiment. 

As discussed in Chapter 3, ^-dimensional pattem vectors can be håndled by 
the network with / = n + 1 input nodes, where the augmentation from n to n + 1 
input nodes is required for single- and multilayer feedforward networks trained in 
a supervised mode. Thus, the choice of the input pattem dimensionality uniquely 
imposes the size of the input layer of the network to be designed. Understand- 
ably, the choice of input representation has rather significant consequences on 
the properties of the input pattems. As will be shown later, pattems that are 
difficult to classify in low-dimensional space often become easily separable after 
dimensionality expansion. 

Let us look at conditions for selecting the number of output neurons. The 
number of neurons in the output layer, K , can be made equal to the dimension 
of vectors to be associated if the network is employed as an associator of input 
to output vectors. If the network works as an auto-associator, which associates 
the distorted input vector with the undistorted class prototype, then we obviously 
have 7 = K. In the case of a network functioning as a classifier, K can be made 
equal to the number of classes. In such cases, described earlier in Section 3.7 as 
local representation, the network would also perform as a class decoder, which 
responds with only one output value different than all the remaining outputs. The 
network can be trained in this output representation to indicate the class number 
as equal to the active output number. 
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The nurhber of output neurons K can sometimes be lowered if no class 
decoding is required. Binary-coded or other encoded class numbers can be pos- 
tulated for such a classifier for the purpose of its training. This is called the 
distributed representation case as described in Section 3.7. For example, a four- 
class classifier can be trained using only two output neurons having outputs, after 
thresholding, of 00, 01, 10, and 11, for classes 0, 1, 2, and 3, respectively. For 
the same reason, alphabet character classifiers would need at least five neurons 
in the output layer since there are 26 classes to be identified. 

We can easily note that the number of output neurons in a ÅT-class classifier 
can be any integer value from log 2 A r through K , including these boundaries. 
These numbers correspond to the local and distributed representation of classi- 
fiers’ output data, respectively. It is, however, somewhat likely and intuitively 
plausible that shrinking the network and the compression of the output layer 
below the number of K neurons will affect the length of the training itself, and 
the robustness of the final network. 


Necessafy Number of Hidden Neurons 

The size: of a hidden layer is one of the most important considerations when 
solving actual problems using multilayer feedforward networks. The problem of 
the size choicjs is under intensive study with no conclusive answers available thus 
far for many tasks. The exact analysis of the issue is rather difficult because of 
the complexity of the network mapping and due to the nondeterministic nature of 
many successfully completed training procedures. In this section we will look 
at some guidelines that may assist a neural network modeler with a number of 
useful hints. 

Single-hidden layer networks can form arbitrary decision regions in n- 
dimensional input pattem space. There exist certain useful solutions as to the 
number J of hidden neurons needed for the network to perform properly. The 
solutions also determine the lower bound on the number of different pattems P 
required in the training set. As will be shown, the number of hidden neurons 
depends on the dimension ri of the input vector and on the number of separable 
regions in ^-dimensional Euclidean input space. 

Let us assume that the ^-dimensional nonaugmented input space is linearly 
separable into M disjoint regions with boundaries being parts of hyperplanes. 
Each of the M regions in the input space can be labeled as belonging to one 
of the R classes, where R < M. Figure 4.20 shows an example separation for 
n = 2, M = 7, and R = 3. Intuitively, it is obvious that the number of separa- 
ble regions M indicates the lower bound on the size P of the set required for 
meaningful training of the network to be designed. Thus, we must have P > M. 

Choosing P = M would indicate the coarsest possible separation of input 
space into M regions using parts of partitioning hyperplanes. Clearly, in such a 
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| | Class 1 

o-* 

|~F ~h ~P] Class 3 


Figure 4.20 Two-dimensional input space with seven separable regions assigned to one of 
three classes. 



case when one training pattem per region is available, granularity of the pro- 
posed classification is not fine at all. The more curved the contours of separated 
decision regions and involved pattem Space partitioning are, the higher should be 
the ratio of P/M > 1. Choosing P /M >> 1 would hopefully allow the network 
to discriminate pattem classes using fine piecewise hyperplane partitioning. 

There exists a relationship between Af, /, and n allowing for calculation 
of one of the three parameters, given the other two of them. As shown by 
Mirchandini and Cao (1989), the maximum number of regions linearly separable 
using J hidden neurons in n-dimensional input space is given by the relationship 


M(J, n) 



where 



= O for J < k 


(4.49a) 


Using formula (4.49a) allows the estimation of the hidden layer size J given n 
and M. A simple computation shows that the network providing the classifica- 
tion illustrated in Figure 4.20 should have three hidden layer neurons provided 
input pattems are of dimensionality two, or n = 2. Let us consider the case 
of input pattems of large dimension assuming that the expected, or estimated, 
size of the hidden nodes is small. For large-size input vectors compared to the 
number of hidden nodes, or when rø >: /, we have from (4.49a) 


M = 





= 2 ' 


(4.49b) 


It follows from (4.49b) that the hidden neuron layer with three nodes would be 
capable of providing classification into up to eight classes; but since n> J, the 
size of the input vector has to be larger than three. 

The formulas (4.49) can be inverted to find out how many hidden layer neu- 
rons J need to be used to achieve classification into M classes in w-dimensional 
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pattem space. This number constitutes the solution of the equation 

M 

2! 

, , /(/- 1)- ■•(/-« + 1) f 

I + + j , for J 

3! ni 

For the case \J < n we have for J from (4.49b) simply 

J = log 2 M 


EXAMPLE 4.4 


In this example we will use the discussed guidelines to propose suitable 
network; architectures for the solution of the two-dimensional XOR problem. 
Assuming / < n and n = 2, we obtain from (4.50b) that M — 4 for J - 2. 

Thé corresponding partitioning of the two-dimensional input space is 
shown iki Figure 4.2 l(a) along with an example network implementation 
using a single hidden layer architecture. The reader can verify that the net- 
work provides an appropriate classification by replacing continuous neurons 
with TLU elements and analyzing the mapping performed by each unit. 
Altematively, the same two-dimensional XOR problem can be considered 
in three-dimensional space. It is possible to add the third dimension x 3 
indicatiiig, if equal to 1, the conjunction of both inputs x x and x 2 being 1; 
otherwise, x 3 remains zero. 

Now, a single plane can partition the cube as simply as shown in 
Figure 4.21(b). It can be seen that neuron 1 performs the mapping of x x 
and x 2 to yield an appropriate x 3 value. Neuron 2 implements the decision 
plane as shown. Indeed, noting that n — 3 and M = 2, the formula (4.50b) 
yields J = 1. Thus, a single hidden layer unit as shown in Figure 4.2 l(b) 
can be Used to solve altematively the XOR problem using a total num- 
ber of two neurons instead of three as in Figure 4.21a. The size of the 
hidden layer has been reduced by 1, or 33%, at no cost. This example 
has shown how to select the size of the hidden layer to solve a specific 
classification problem. wm 

Another way of optimizing the architecture is related to pruning of feed- 
forward multilayer network . Pruning is done as network trimming within the 
assumed initial architecture. The network can be trimmed by removal of unim- 
portant weights. This can be accomplished by estimating the sensitivity of the 
total error to the exclusion of each weight in the network (Kamin 1990)' The 
weights which are insensitive to error changes can be discarded after each step 
of incremental training. Unimportant neurons can also be removed (Sietsma and 
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Dow 1988). In either case it is advisable to retrain the network with the modified 
architecture. The trimmed network is of smaller size and is likely to train faster 
than before its trimming. 


4.6 


CLASSIFYING AND EXPERT 
LAYERED NETWORKS 

Multilayer feedforward neural networks can be successfully applied to a 
variety of classification and recognition problems. The application of such net- 
works represents a major change in the traditional approach to problem solution. 
It is not necéssary to know a formal mathematical model of the classification or 
recognition problem to train and then recall information from the feedforward 
neural systerhs. Instead, if a sufficient training set and a suitable network archi- 
tecture are 4 ey i se d, then the error back-propagation algorithm can be used to 
adapt network parameters to obtain an acceptable solution (Cybenko 1990). The 
solution is obtained through experimentation and simulation rather than through 
rigorous and formal approach to the problem. As such, neural network compu- 
tation offers techniques that are in the middle ground, somewhere between the 
traditional engineering and the artificial intelligence approach. 

Although it is not yet clear from the technical literature what constitutes an 
adequate training pattem set and network architecture, a number of successful 
applications ranging from speech recognition to sonar signal processing have 
been reported (Sejnowski and Rosenberg 1987; Gorman and Sejnowski 1988). 
A number of technical reports describe other successful phonetic classification 
and speech epcperiments (Leung and Zue 1989; Waibel 1989; Bengio et al. 1989). 
More detailed exposition of multilayer feedforward network applications for prac- 
tical tasks is provided in Chapter 8. The discussion below is mainly to enhance 
the principiel to illustrate the main features of the training, and to discuss ex- 
ample results. 

Let us summarize the task the error back-propagation network needs to 
solve. Given are the training data in the form of P vector pairs (z p ,d p ), or the 
training set 


(Op-dp), p=\,2,...,P] 


(4.51) 


For a given pattem p the network maps z p into o p using the highly nonlinear 
operation of Equation (4.31a) as follows: 


o p = r[WT[Vz p ]] (4.52) 

The goal of the training has been to produce o p such that it replicates d p , or 
o(Zp) = d p . The quality of approximation is determined by the error 


E = X||o(W,V,z p )-d p 

p= i 


2 


(4.53) 
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Minimization of error (4.53) can be interpreted as a classical interpolation and 
estimation problem — given data, we seek parameters w and v that approximate 
data for a class of functions selected as in (4.52). 


Character Recognition Application 

Let us look at a typical application of the error back-propagation algorithm 
for handwritten character recognition, details of which are covered in Chapter 
8 (Burr 1988). An input character is first normalized so that it extends to the 
full height and width of the bar mask. The handwritten alphabet character is 
then encoded into 13 line segments arranged in a template as in Figure 4.22(a). 
The encoding takes the form of shadow projection. A shadow projection op- 
eration is defined as simultaneously projecting a point of the character into its 
three closest vertical, horizontal, and diagonal bars. After all points are projected, 
shaded encoded bars are obtained. 

Each of the 13 segments shown in Figure 4.22(b) is now represented by the 
shadow code ranging from 0 to 50. The shadow codes for the bars representing 
character S from top to bottom, left to right are: 50, 46, 4, 14, 50, 24, 14, 6, 43, 
42, and 50. The shadow codes can be understood as extracted pattem features. 
The shadow codes need to be normalized to within the 0, 1, which is the range 
for unipolar neurons used in the network. This enables the use of the activation 


l 



(a) (b) 

Figure 4.22 Thirteen-segment bar mask for encoding alphabetic Capital letters: (a) template 
and (b) encoded S character. [Adapted from Burr (1988), © IEEE; reprinted with permission.] 
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function in its region of high slope and makes the training easier and more 
efficient. The network thus has 13 real valued inputs between 0 and 1 , and 26 
outputs, onelfor each alphabetic character. The convention used has been that 
the output corresponding to a letter should be 1 if the character is detected, and 
0 otherwise. 

Thus, the network is of size / = 14 (including one fixed input) and K = 
26. The network has been tested with 12, 16, 20, and 24 hidden-layer neurons. 
Networks wiih 20 or 24 hidden-layer neurons usually trained faster with tj = 0.3 
and 2.0. The training data consisted of 104 handprinted uppercase characters, 
4 for each letter scaled and centered in the template as in Figure 4.22. After 
training, the network performed at levels of 90 to 95% accuracy for the new 104 
written samples provided to test the generalization. 

In a related experiment of handwritten digit recognition, a seven-segment 
template has been used for shadow code computation. The template has been 
obtained by removing bars 3-5 and 9-11 from the 13-bar template of Figure 
4.22(a). A similar projection method has been employed to obtain shadow codes 
for 100 handwritten numerals 0 through 9. Fifty vectors have been used to train 
the network, 50 remaining vectors were needed for test purposes. The archi- 
tecture was trained with 7 inputs and 10 output neurons. The best results have 
been reported by Burr (1988) for the 6 to 16 hidden-layer neurons for 77 = 2 
and a = 0.9 used for training. The accuracy of recognition achieved using this 
method was between 96 and 98% correct answers on the test set. 

For the experiment with letter recognition, there was a total ofl4X12+13X 
26 = 506 weights for the 12 hidden nodes of the network. We may thus consider 
that the network had 506 degrees of freedom. As observed by Cybenko (1990), 
the number of degrees of freedom in many successful applications exceeds the 
number of training samples used to train the network. This, in tum, contradicts 
the common belief that the estimation problems should be overdetermined, and 
that there should be more data points than parameters to be estimated. This, 
in particular, is the case in linear regression and other conventional data fitting 
approaches. Neural networks of the class discussed in both the example cov- 
ered as well as in other applications typically seem to solve underdetermined 
problems (Cybenko 1990). One of the possible explanations is that the true 
network dimensionality is not determined by the number of its weights, but is 
considerably smaller. 

EXAMPLE 4.5 

This example demonstrates network training using the error back-propaga- 
tion technique for the bit-map classification of a single-hidden layer net- 
work. The task is to design a network that can classify the simplified bit 
maps oi three characters as shown in Figure 4.23(a). Assuming that the 
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Figure 4.23a, b Figure for Example 4.5: (a) bit maps for classification, (b) resulting weight ma 
trices, 16 nonaugmented inputs, eight hidden layer neurons, three output neurons. 
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Figure 4.23c 

den layer sizes. 

Figure for Example 4.5 (continued): (c) learning profiles for several different hid- 

dimensi< 
the folk 

Jnality of the nonaugmented input pattem is 16, we have P = 3 for 
•wing input /desired output training data: 


Character C:z 1 = [lll — 11 — 1 — 1 — 11 — 1 -1 -1111 -1]' 
Class number: d x = [ 1 0 0] r 

Character N: z 2 = [ 1 -1 -1 1 1 1 -1 1 1 -1 1 1 1 -1 -1 1 ]' 
Class number: d 2 = [0 1 0] r 

Character Z: z 3 = [ 1 1 1 1 -1 -1 1 -1 -1 1 -1 -1 1 1 1 1]' 
Class number: d 3 = [0 0 l]* 

Thé readers may notice that black and white pixels are coded as 1 and 
— 1, respectively, and that the local representation is used for detecting the 
class membership. 

Thé program listed in the Appendix has been used to train the network 
with eight hidden neurons with tj = 1 and A = 1. Due to the necessary 
augmentation of inputs and of the hidden layer by one fixed input, the 
trained network has 17 input nodes, nine hidden neurons, and three output 
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neurons with an architecture like that of Figure 4.7. Error E has been defined 
as in (4.4) or as in the EBPT algorithm of Section 4.4. Since this is a rather 
simple classification task, the training takes only 144 iterations, which is 
an equivalent of 48 incremental training cycles consisting of three pattems 
each. The resulting weight matrices W and V are shown in Figure 4.23(b). 

The typical output vectors recalled by the trained network with input 
vectors equal to z l9 z 2 , and z 3 are, respectively, 

Oj = [0.901 0.050 0.078]' 

o 2 =[ 0.011 0.984 0.013]' (4.54) 

o 3 = [0.014 0.013 0.984]' 

Simple thresholding of the computed vector entries of (4.54) provides the 
desired binary classification. It has been observed that using a value of 0.2 
for 7] has increased the number of training steps to 722 for the same error 
E m ax = 0.01. Using values of 5 and 10 for 17 has reduced the number of 
training steps to 37 and 16, respectively. However, drastic changes of E in 
both directions for large 17 values have also been observed. Thus, to achieve 
moderately fast and reliable leaming, the leaming constant of value 17 = 1 
has been used. 

Different sizes of hidden layers have been attempted during training 
with 7 } = 1. Leaming profiles for 4, 6, 8, and 12 (nonaugmented) hidden 
nodes are shown in Figure 4.23(c). Increasing the number of hidden nodes 
has allowed for training to converge in fewer training steps. It takes longer, 
however, to train the large network. In addition, the network hardware be- 
comes more expensive. The increase in the number of hidden nodes from 8 
to 16 produced marginal reduction of the training from 144 to 106 steps. It 
has therefore been concluded that the suitable number of hidden nodes for 
this type of problem is between 4 and 8. mm 


Expert Systems Applications 

In the previous sections of this chapter, we introduced the supervised leam- 
ing techniques of layered feedforward networks. Let us again consider such 
layered networks, which respond with outputs that can be represented as variables 
assuming one of several possible values, or continuum of values. Let us note 
that considering more than two output values of a neuron, or even continuous 
outputs, would make it possible to handle uncertainties. The degree of belief in 
a certain response can be determined using such fine quantized outputs. 

The conventional approach to building an expert system requires a human 
expert to formulate the rules by which the input data can be analyzed. The 
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number of rules needed to drive an expert system may be large. Further, the 
lack of rigorous analysis, or even a lack of understanding of the knowledge 
domain, often makes formulation of such rules difficult. The rule formulation 
may become particularly complex with large sets of input data. Realizing that 
layered networks can be trained without encapsulating the knowledge into the 
rules, let us look at how they can be applied as an alternative to conventional 
rule-based expert systems. 

Neural networks for diagnosis, fault detection, predictions, pattem recog- 
nition and Association solve essentially various classification, association, and 
generalizatidn problems. Such networks can acquire knowledge without extracting 
IF-THEN rules from a human expert provided that the number of training vector 
pairs is sufficient to suitably form all decision regions. Thus, neural networks 
would be able to ease the knowledge acquisition bottleneck that is hampering the 
creation and development of conventional expert systems. After training, even 
with a data-rich situation, neural networks will have the potential to perform like 
expert systems. What would often be missing, however, in neural expert system 
performance is their explanation function. Neural network expert systems are 
typically unable to provide the user with the reasons for the decisions made. The 
applications of neural networks for large-scale expert systems will be discussed in 
more detail in Chapter 8. This section covers only basic application concepts for 
layered diagnostic networks that are trainable using the error back-propagation 
technique. We will call such networks connectionist expert systems. 

Let us take another look at the training and recall phases of error back- 
propagation-trained networks. Assume that a feedforward layered network is 
trained using the training vector pairs (z 1? dj), (z 2 , d 2 ), . . . , (z F , d P ). In the test, or 
recall, phase, the network described by Equation (4.52) performs the recognition 
task if it is tested with the input being a vector z, (1 ^ i ^ P) corrupted by 
noise. The network is expected to reproduce o, at its output in spite of the 
presence of noise. If no noise has been added, the network performs either a 
simple classification or association task. 

If the trained neural network is tested with an input substantially different 
from any of the training set members, the expected response is supposed to sol ve 
the generalisation problem. The generalization of the knowledge of the domain, 
which the network has leamed during training, should cause it to respond cor- 
rectly to any unseen before new input vector. Typical examples of generalization 
are diagnoses and prediction. The classification task may also be performed as 
a generalization operation but only in cases for which the domain models are 
partially defined. Examples of such classifications are medical and most technical 
diagnoses. 

Let us consider first how a connectionist expert system for medical diagno- 
sis can be built (Gallant 1988; Hripcsak 1988). A block diagram of an example 
expert system is shown in Figure 4.24. Input nodes take the information about the 
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patient and disease 


Figure 4.24 Connectionist expert system for diagnosis. 


selected symptoms and their parameters. These can be numerical valnes assigned 
to symptoms, test results, or a relevant medical history for the patient. In the 
simplest version, input signals may be the binary value 1, if the answer to the 
symptom question is yes, or —1, if the answer is no. The “unknown” answer 
should thus be made 0 to eliminate the effect of the missing parameter on the 
conclusion. With / input nodes, the network is capable of handling that many 
binary or numerical disease or patient data. The number of identifiable diseases 
can be made equal to the number of output nodes K. 

In conventional automated medical diagnosis, a substantial obstacle is the 
formulation and entry of many rules by an expert. In contrast to a rule-driven 
system, a connectionist expert system can leam from currently accepted diag- 
noses. Diagnoses are generated from available hospital or doctor databases. As 
disease pattems and the art of diagnosis change, the connectionist expert system 
for diagnosis needs only be retrained or trained incrementally on a new set of 
data. In the case of major changes in diagnostic techniques, however, the expert 
system’s numbers of input and hidden nodes and its architecture may also have 
to be changed. 

As mentioned before, special care needs to be taken when choosing the 
number of hidden nodes. Too low a number of hidden nodes J can cause diffi- 
culties in mapping of I inputs into K outputs; too large a value for J will increase 
unnecessarily the leaming and diagnosis times and /or cause uncertainty of the 
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(c) 

Figure 4.25 Automobile engine diagnostic data: (a) no-fault engine, (b) defective spark plug 4, 
and (c) defective fuel injector. [Source: Marko et al. (1989). © IEEE; reprinted with permission.] 

training objective. In general, for J larger than needed, weights become more 
difficult to estimate reliably from the training data. 

Let us review an example of a connectionist expert system for fault diagno- 
sis of an aiitomobile engine (Marko et al. 1989). The input and output data are 
controlled and monitored by an electronic engine control computer. Waveforms 
of analog /digital data such as those shown in Figure 4.25(a) are obtained first 
from an engine without faults, which is accelerated in neutral against the inertial 
load. These multichannel data serve as a reference and they are used for training 
of the expert system modeling the fault-free operation of an engine. Although a 
skilied engineer can analyze the complex engine output data shown in the figure, 
most technical personnel would find the interpretation of them rather difficult. 

The practical Identification of a fault becomes more complicated and time- 
consuming when signal anomalies are small and distributed over several signals 
that are time functions. This can be observed in Figures 4.25(b) and (c), which 
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show a set of input /output signals for a faulted engine. Figure 4.25(b) displays 
signals with a defective spark plug on cylinder 4, which causes a misfire at this 
cylinder in each engine cycle. Figure 4.25(c) displays the data from an engine 
with a defective fuel injector. In particular in this case, the fault can only be 
identified from simultaneous comparison of many of the traces of data. 

The task of defective engine diagnosis has been successfully solved by a 
back-propagation trained single-hidden layer network. The connectionist expert 
system for engine diagnosis makes it possible to identify 26 different faults such 
as a shorted plug, an open plug, a broken fuel injector, etc. The training set 
consists of 16 sets of data for each failure, each of the sets representing a single 
engine cycle. A total of 16 X 26 data vectors with 52 elements in each vector 
has been used for training. 

The described neural network-based expert system needs 10 minutes of train- 
ing time on the NESTOR NDS-100 computer. It attains 100% fault recognition 
accuracy on the test data set. Low leaming rates, a number of hidden units equal 
to twice the number of inputs, and a randomized presentation order within the 
training set have been used in the initial development phase. To improve the 
leaming speed, the number of hidden units was then decreased to less than the 
number of input units. As a result, the leaming time decreased five times while 
maintaining 100% accuracy on the test set. 


Leaming Time Sequences 

A number of practical applications require neural networks that respond to 
a sequence of pattems. Figure 3.2(b) exemplifies a case of a temporal pattem 
consisting of waveform samples. The neural network should produce a particular 
output in response to a particular sequence of inputs. Speech signals, measured 
waveform data, and Control signals are examples of waves that are considered 
as discrete-time sequences rather than as unordered data. 

One simple way to use the conventional training methods for mapping of 
training data sequences is to tum the temporal sequence into a spatial input pat- 
tem. When this is accomplished, we may consider that the given set of samples 
in a sequence is fed simultaneously to the network. This case is illustrated in 
Figure 4.26. It can be seen that a series connection of delay elements provides 
the solution to the problem. Inputs to the network are samples x t = x(t - i‘A), 
for i = 0, 1, . . . , n. The network with tapped delay lines is called a time -delay 
network . The figure shows a single-channel temporal sequence. We may note that 
for handling multidimensional data sequences, the series connection of delay el- 
ements and tapping must be implemented through an appropriate expansion of 
the input node layer. 
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Note: A is equal to the sampling period 


Figure 4.26 A time-delay neurai network converting a data sequence into the single data vec- 
tor (single variable sequence shown). 


— FUNCTIONAL 
LINK NETWORKS 

Earlier discussion in this chapter focused on two-layer mapping networks 
and their training. The hidden layer of neurons provides an appropriate pattem 
to image transformation, and the output layer yields the final mapping. Instead 
of carrying out a two-stage transformation, input /output mapping can also be 
achieved through an artificially augmented single-layer network. The separating 
hyperplanes generated by such a network are defined in the extended input space. 
Since the network has only one layer, the mapping can be leamed using the 
simple delta leaming rule instead of the generalized delta rule. The concept of 
training an augmented and expanded network leads to the so-called functional 
link network as introduced by Pao (1989). Functional link networks are single- 
layer neurai networks that are able to handle linearly nonseparable tasks due to 
the appropriately enhanced input representation. 

The key idea of the method is to find a suitably enhanced representation 
of the input data. Additional input data that are used in the scheme incorporate 
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Figure 4.27 Functional link network. 


higher order effects and artificially increase the dimension of the input space. 
The expanded input data are then used for training instead of the actual input 
data. The block diagram of a functional link network is shown in Figure 4.27. 
The network uses an enhanced representation of input pattems composed now of 
/ + H components. The higher order input terms applied at H bottom inputs of 
the functional link network from Figure 4.27 are to be produced in a linearly 
independent manner from the original pattem components. Although no new 
information is explicitly inserted into the process, the input representation has 
been enhanced and the linear separability can be achieved in the extended space. 
The discussion below covers two methods of extending the dimensionality of the 
input space for functional link network design. 

Assume that the original data are represented by /-tuple vectors. In the 
so-called tensor model, suitable for handling input pattems in the form of vec- 
tors, the additional input terms are obtained for each /-dimensional input pattem 
as the products x t Xj for all 1 < i and j < / such that / < y < J (case A). A 
number of product terms generated is shown in Figure 4.28. Altematively, the 
products can be computed as in case A and augmented with x t XjX k terms for all 
1 < i, j, fc < / such that i < j < k < J (case B). This discussion shows that 
the number of additional inputs required for the functional link method grows 
very quickly. The increment H in the size of input vector is also given in Figure 
4.28. The figure also lists several first new product terms generated in each of 
the cases A and B. The folio wing example revisits Example 4.1 and vividly 
demonstrates the power of the functional link approach. 
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Figure 4.28 Increase in input vector size and example additional terms for vector input patterns. 


EXAMPLE 4.6 


Let us discuss a simple solution of the two-dimensional XOR problem with 
the functional link network. Both cases A or B yield the single product x x x 2 
as the third input. We thus have H — 1. The new, extended input Space 
representation using coordinates {x x , x 2 , *\X 2 } of extended XOR input data 
is shown in Figure 4.29(a). The training set in the new input space consists 
of fbur patterns {—1 —1 1,-1 1 — 1 , 1 —1 — 1 , 1 1 1} and the ordered 
set output of target values is {-1,1,1,-!}. 


□ Class 1 
O Class 2 




Figure 4.29 Figure for Example 4.6: (a) linear separability through enhanced input representa- 
tion and (b) network diagram. 
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Number of iterations Number of iterations 

(a) (b) 

Figure 4.30 Comparison of learning rate between the hidden layer network [curves (a)] and 
fundional link network [curves (b)]: (a) two-variable XOR problem and (b) three-variable XOR 
problem. [Soufce: Pao (1989). ©1989 Addison-Wesley; reprinted with permission.] 


We now see that the enhanced input pattems are linearly separable. An 
example equation defining the weights is 

x x x 2 - 0 (4.55) 

The horizontal plane as in (4.55) passing through the origin provides one of 
the infinite number of solutions. The related functional link network diagram 
is shown in Figure 4.29(b). 

The learning rates of functional link networks are reportedly signif- 
icantly better for two- and three-variable XOR problems than those of 
two-layer networks. Curves (b) of Figure 4.30 obtained for the functional 
link approach show the dramatic improvement of learning speed over the 
conventional back-propagation-trained two-layer networks [curves (a)]. m* 

In the so-called functional model of a functional link network, the higher 
order input terms are generated using the orthogonal basis functions. Functions 
such as sin ttx, cos irx , sin 277 *, cos 277 *, . . . , can be used to enhance the repre- 
sentation of input x. A function f(x) of a single variable can also be leamed in 
a single-layer network. Obviously, there is a question about how many of the 
terms need to be retained to achieve efficient learning of function approximation. 
Excellent results have been reported by Pao (1989) with learning of an example 
function /(*) using its 20 sample point pairs *,/(*). The terms that needed to be 
generated have been x , sin 77 *, cos 77 *, sin 277 *, cos 277 *, and sin 477 *. 
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The reader has certainly noticed that the network with a single-layer, or 
so-called “flat” neural network based on the concept of a functional link does 
not strictly belong to the class of layered networks discussed in this chapter. It 
has only one layer of neurons. However, due to its intrinsic mapping properties, 
the functional link network performs very similarly to the multilayer network. 
The linearly nonseparable pattems acquire an enhanced input representation, so 
that mapping can be performed through a single layer. The distinet advantage of 
the functional link network is apparently easier training; however, more research 
needs to be done to compare conclusively the functional link approach with the 
error back-propagation alternative. 


CONCLUDING REMARKS 

The material in this chapter describes the principal concepts of multilayer 
feedforward artificial neural networks. Appropriate input Space mapping proper- 
ties are developed starting with the simple prototype point space partitioning and 
its mapping to the linearly separable image space. The scheme, originally intro- 
duced for discrete perceptron networks and visualized in low-dimensional space, 
is then extended to multilayer neural networks using continuous perceptrons and 
formulated ih terms of network training. 

The delta rule and generalized delta rule training algorithms have been 
developed. It has been illustrated that the back-propagation algorithm solves the 
training task of an artificial layered neural network to perform potentially arbi- 
trary input-output mappings defined by training examples. The algorithm uses 
the least mean square error minimization strategy. The gradient of error resulting 
for the current value of the input pattem is computed. A step in the weight 
space is then taken along the negative gradient to reduce the current error value. 
The advantage of the method is that each adjustment step is computed quickly, 
without presentation of all the pattems and without finding an overall direction 
of the descent for the training cycle. The discussion of leaming parameters has 
been provided to enhance the understanding of error back-propagation leaming 
performance. Due to the randomness of the minimum search during training, the 
error back-propagation technique is based on stochastic approximation theory. As 
such, it provides accurate input-output mapping in a statistical sense. 

The basic concepts underlying the idea of classification applications and of 
intelligent layered neural networks have also been outlined in this chapter. We 
have seen how back-propagation-trained networks can perform like rule-based 
expert systems. Application examples of diagnostic layered networks show the 
feasibility of the neural network approach for the selected practical tasks of 
classification of characters and engine fault diagnosis. An alternative approach to 
the generalized delta rule using layered network training has also been presented. 
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The functional link method is able to implement arbitrary input-output mappings 
due to the proper expansion of input data representation. It merely requires simple 
delta rule training since it employs a single-layer network. 


PROBLEMS 

Piease note that problems highlighted with an asterisk (*) are typically com- 
putationally intensive and the use of programs is advisable to solve them. 

P4.1 The linearly nonseparable pattems x 1? . .., x 10 listed below have to be 
elassified in two categories using a layered network and an appropriate 
pattem-image space transformation. Design a two-layer classifier with the 
bipolar discrete perceptrons based on the appropriate space mapping. 
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x 4 ,x 6 ,x 7 ,x 8 ,x 9 : class 1; remaining pattems: class 2. 


P4.2 Linearly nonseparable pattems as shown in Figure P4.2 have to be clas- 
sified in two categories using a layered network. Construct the separating 
planes in the pattem space and draw pattems in the image space. Calculate 
all weights and threshold values of related TLU units. Use the minimum 
number of threshold units to perform the classification. 



Figure P4,2 Patterns for layered network classification for Problem P4.2. 
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P4.3 The layered bipolar perceptron network as shown in Figure P4.3 imple- 
ments a concave partition of space x x ,x 2 . The values inside TLU elements 
are their respective thresholds T. Find the subset 9? that results in the 
response o = 1 and draw it on the x l9 x 2 plane. 

P4.4 Two planar input pattem regions of class 1 and 2 are shown in Figure 
P4.4. Note that since one decision region is concave, the dichotomization 
shown is linearly nonseparable. Using the layered bipolar discrete per- 
ceptron network with TLU elements, design the dichotomizer to separate 
pattem classes as shown. Solve the problem for Figure P4.4(a) using two 
hidden layers and for (b) using a single hidden layer by appropriate input 
to image space mapping. 
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Figure P4.4 Planar images for classification using discrete perceptrons for Problem P4.4. 


P4.5 


P4.6 


Note that the TLU networks producing a unit step function defined as 


/ 0 (w'x) = { 


1 for w'x > T 
0 for w'x < T 


can implement Boolean functions. Find the diagrams of networks consist- 
ing of discrete perceptrons and appropriate weights. The networks should 
realize the following functions (a prime denotes the logic complement 
operation): 

(a) F(x x ,x 2 ,x 3 ) = x x x 2 x 3 

(b) F(x x ,x 2 ,x 3 ) = x x + x 2 + x 3 

(c) F(x i,x 2 ,x 3 ) = x x (x 2 + x 3 ) 


(d) F(x l9 x 2 ,x 3 ) - x\x 2 + x\x 2 x 3 + x x (x 2 x 3 + x 2 x 3 ) 

knowing that x x , x 2 , and x 3 are Boolean variables 0 and 1, and that 
the positive logic is used. 


A layered TLU network employing a single hidden unipolar TLU as de- 
fined in Problem P4.5 and a single output unipolar TLU can be used to 
implement the XOR function defined in Example 4.1. Partial design of 
the network is shown in Figure P4.6. Specify all missing weights of the 
network including the threshold value of the output unit. Note that the 
image and input pattem data are combined in this case in an attempt to 
achieve proper response o 4 . 
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Figure P4.6 Dichotomizer using two unipolar TLU elements for solution of the XOR problem 
from Problem P4.6. 


P4.7 The network shown in the Figure P4.7 and employing unipolar TLU ele- 
ments similar to those used in Problem P4.5 has been trained to classify 
all eight three-bit pattem vectors that are vertices of a three-dimensional 
cube. The training set is {o j = {0 0 0, , 1 1 1}. Analyze 

the network and find the function it implements in terms of the inputs o l9 
o 2 , and o 3 . 

P4.8 You are presented with the prototypes in augmented form: 



A layered machine with two discrete bipolar perceptrons in the hidden 
layer and a single discrete bipolar output perceptron needs to classify 
the prototypes so that only x l9 x 2 , and x 3 belong to class 1, with the 
remaining prototypes belonging to class 2. 

(a) Check whether weight vectors 



would provide the linear separation of pattems as required. 
(b) Repeat part (a) for the new weight vectors: 
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(Unipolar) 


Figure P4.7 Layered discrete perceptron network for Problem P4.7. (All unlabeled weights are 
of unity value.) 


(c) Complete the design of the classifier by using the results from either 
part (a) or (b), and compute the weights of the single perceptron at 
the output. 

P4.9 Prove that for n = 2, the number of hidden layer neurons J needed for 
hyperplane partition into Af regions is 

/= i (V8 M - 7 - l) 

P4.10 Assume that a two-class classification problem of a planar pattem ( n = 2) 
is solved using a neural network architecture with / = 8 and K = 2. 
Determine the lower bound on P, which is the number of vectors in 
the planar training set. This number is equal to the number of separable 
regions Af. 

P4.ll Assume that a two-class classification problem for n = 60 needs to be 
solved using J = 7 and K = 2. Determine the lower bound on P, which 
is the number of vectors in the training set. Assume that P is equal to the 
number of separable regions Af. 

P4.12 Planar input pattems of four classes are shown in Figure P4.12. Using the 
layered network of bipolar discrete perceptrons, design the classifier for 
the linearly nonseparable classes shown. Use three perceptrons in the out- 
put layer; when none of the three perceptrons responds 4-1, this indicates 
class 4. 




240 


MULTILAYER FEEDFORWARD NETWORKS 



Figure P4.12 Planar images for classification using discrete perceptrons for Problem P4.12. 


/M.73*Implement the delta training rule algorithm for a four continuous bipolar 
perceptron network and a four-dimensional augmented pattem vector. The 
network trained is supposed to provide classification into four categories 
of P pattems. Perform the training of the network for pattem data as 
in Problem P3.1. (The number of training pattems is P = 8.) Use local 
representation, assume A = 1 for training. 

P4.14 The network shown in Figure P4.14, when properly trained, should re- 
spond with 


0\ 

.° 2 _ 


to the augmented input pattem 
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The network weights have been initialized as shown in the figure. Analyze 
a single feedforward and back-propagation step for the initialized network 
by doing the folio wing: 

(a) Find weight matrices V and W. 

(b) Calculate net ; , y, net*, and o. 
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(c) Calculate slopes f '(netj) and f'(net k ). 

(d) Compute error signals 8 0 and 8 y . 

(e) Compute AV and AW. 

(f) Find updated weights. 

For computations assum t f (net) = [1 + exp(-net)]~~ l and 17 = 1. 

P4.75*Write a program implementing the error back-propagation training algo- 
rithm (EBPTA) for user-selectable 7, /, and K values for a single hidden 
layer network. The flowchart of the algorithm is outlined in Section 4.4. 
Leaming constant 17 should be user-selectable; no momentum term is 
needed. The initial weights for the network should be selected at random. 
Provisions for specification of input pattem(s) and the desired response(s) 
should be made in order to initiate and carry out the training. Use bipolar 
continuous perceptrons. 

P4. 1 6* (This problem requires the use of a back-propagation training program 
written by the student in Problem P4.15, or available from other sources.) 

Implement the classifier of three printed characters A, I, and O as 
shown in Figure P4.16. Set an appropriate value such that an error- 
free classification is assured. Assume no momentum term; try different tj 
values. Evaluate the number of training cycles for comparable 17 values 
for two different architectures. The target values should be selécted as 
(1 -1 — 1 ) for A, ( — 1 1 -1) for I, and (-1 -1 1 ) for O. 
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Assume two different input representations according to the following 
guidelines: 

(a) 16 pixels, three input vectors in the training set, n = 16 (/ = 17, 
/ = 9, and K = 3) 

(b) two coordinates of each pixel, 16 input vectors in the training set, 
n = 2 (/ = 3, J = 9, and K = 3). 

P4.17 The network shown in Figure P4.17 has been trained to classify correctly 
a number of two-dimensional, two-class inputs. 

(a) Draw the separating lines between the two classes on the x x ,x 2 plane 
assuming that the neurons operate with the discrete bipolar activation 
function. 



Figure P4.17 Trained neural network for Problem P4.17. 
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Figure P4.18 Trained neural network for Problem P4.18. 


(b) Assume the bipolar continuous activation function of neurons. Now, 
the thresholding above and below certain levels of output can be 
used to assert the membership in classes. Generate equations for the 
region of uncertainty in classification on both sides of the borderline 
analyzed in part (a) if the following criterion for indecision is used: 
o > 0.9, class 1; o < —0.9, class 2. Perform computations assuming 
for simplicity a = b ~ 0. 

P4.18 The network shown in Figure P4.18 has been trained to classify cor- 
rectly a set of two-dimensional, two-class pattems. 

(a) Identify the function performed by the classifier, assuming ini- 
tially that the neurons have unipolar discrete activation function. 
Draw the resulting separating lines between the two classes on 
the x x ,x 2 plane. 

(b) Generate 36 points of the test inputs within the [0, 1] square 
on the x x ,x 2 plane by incrementing each of the coordinates by 
0.2. By performing recall for each of the test pattems, find the 
responses for the network with continuous perceptrons assuming 
the unipolar activation function o = [1 + exp(— net)]~ l . 
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Figure P4.19 Training pattern set for neural network design in Problem P4.19. 


P4.19 Design a feedforward network for the example training set containing 
linearly nonseparable pattems of three classes shown in Figure P4.19. 

(a) Select a suitable number of neurons in the hidden layer of the 
network using n = 2. Use K = 3 so that the target vectors are 
[1-1-1]', [-1 1 -1]', and [-1 -1 1]' for classes 1, 2, 

and 3, respectively. 

(bf Train the network as selected in part (a) by using the nine vectors 
of the set shown. Select the E ms value that assures an error-free 
classification. Assume no momentum term, try different tj values. 
(This part of the problem requires the use of a back-propagation 
training program written by the student in Problem P4.15 or 
available from other sources.) 

P4.20 The network consisting of unipolar discrete perceptrons responding 
0 for negative input, or otherwise 1, and shown in Figure P4.20 
has been designed to detect one of the aircraft fault modes (Passino, 
Sartori, and Antsaklis 1989). Quantities 0 (deg) and q (deg/s) denote 
the aircraft’s pitch angle and pitch rate, respectively. 

(a) Find the decision regions in coordinates &,q indicating the spe- 
cific failure by analyzing the network diagram. 

(b) Simplify the network architecture, if possible, by reducing the 
number of hidden nodes J. 
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Figure P4.20 Failure detector discrete perceptron network for Problem P4.20. 



246 


MULTILAYER FEEDFORWARD NETWORKS 



Figure P4.22 Training set for Problem P4.22. 


P4.21 A functional link network based on the tensor model has been trained 
with the following extended input vectors 
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of class 2. 


(a) Determine whether the training pattems in the original space are 
linearly separable. 


(b) Deri ve the set of weights of functional link network using the 
single TLU that would satisfy the required classification. 

P4.22 Design a functional link classifier using the tensor model for a set 
of training pattems as shown in Figure P4.22. Use the delta leaming 
rule for training of the single-layer network. 

P4.23 Figure P4.23(a) shows the network that approximates a continuous 
function of a single variable t. The network uses bipolar continu- 
ous perceptrons, has 10 hidden layer units, and a single output unit. 
Knowing that the network weights are as listed in Figure P4.23(b), 




PROBLEMS 


247 



(a) 

[1.12 2.46 6.11 -1.08 0.96 -1.03 -0.58 -1.11 1.13 1.05" 

[0.36 0.27 0.09 0.28 0.24 -0.29 0.12 -0.34 0.05 0.06 

W = [ -1.35 0.14 4.26 1.18 -1.02 1.20 0.55 1.33 -1.27 -1.20 0.45] 

(b) 

Figure P4.23 NetWork for function approximation: (a) network diagram and (b) weight matrices. 


find numerically the function implemented by the network by per- 
forming recall of selected test pattems in the range — 1 ^ t ^ 1 . 

P4.24* Figure P4.24 illustrates an accurate function h(x) (dashed line): 

h(x) = 0.8 sin wc, for — 1 s x < 1 

and its neural network approximation o(x) (continuous line). It can 
be seen that the training conditions have been P = 11, 17 = 0.4, and 
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Accurate function h(x) 
NetWork function o{x) 

Test points for approximation 



X 

Figure P4.24 Approximating h(x) with the neural network of Problem P4.24. 


E xms = 0.02. It can also be seen that 21 test points uniformly covering 
the range [-1,1] have been used. 

Using the simulation conditions for training and testing the net- 
work as illustrated on the figure, find a neural network that performs 
the required approximation. Use single hidden-layer network and 
bipolar continuous perceptrons. 
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SINGLE-LAYER 
FEEDBACK NETWORKS 

Who errs and mends, to 
God himself commends. 

CERVANTES 

5.1 Basic Concepts of Dynamical Systems 

5.2 Mathematical Foundations of Discrete-Time Hopfield Networks 

5.3 Mathematical Foundations of Gradient-Type Hopfield Networks 

5.4 Transient Response of Continuous-Time Networks 

5.5 Relaxation Modeling in Single-Layer Feedback Networks 

5.6 Example Solutions of Optimization Problems 

5.7 Concluding Remarks 


T he neural networks introduced in this chapter represent dynamical systems 
evolving in time in either a continuous, or discrete, output space. The 
movement of the network is usually characterized by many degrees of freedom. 
It is also dissipative in the sense that the movement is in the direction of lower 
so-called computational energies exhibited by the system. As will be shown, tran- 
sition in a dynamical neural network is toward an asymptotically stable solution 
that is a local minimum of a dissipated energy function. 

The feedforward networks presented in Chapters 3 and 4 have no feed- 
back activity during the information recall phase. Let us look in more detail 
at the role and presence of feedback in feedforward networks. As discussed in 
earlier chapters, the feedback interactions in single- and multilayer perceptron- 
type networks occurred during their leaming, or training, phase. The adjustment 
of weights during training gradually reduced the overall output error. The er- 
ror reduction embedded in one of the supervised training algorithms provided 
that feedback-type interaction. Noticeably, that interaction was teacher-enforced 
rather than spontaneous, and extemally imposed rather than occurring within the 
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system. Feedback interactions within the network ceased once the training had 
been completed. 

The single-layer neural networks discussed in this chapter are inherently 
feedback-type nonlinear networks. Neurons with either a hard-limiting acti vation 
funetion or with a continuous activation function can be used in such systems. As 
discussed in Chapter 2, recall of data from within the dynamical system requires 
time. To recall information stored in the network, an input pattem is applied, 
and the network’ s output is initialized accordingly. Next, the initializing pattem 
is removed and the initial output forces the new, updated input through feed- 
back connections. The first updated input forces the first updated output. This, 
in tum, produces the second updated input and the second updated response. 
The process of sequential updating continues, and the activity of outputs for 
a properly designed, or stable, network becomes less frequent. The transition 
process continues until no new updated responses are produced and the network 
has reached its equilibrium. 

In the case of a discrete-time operation , also called recursive, single-layer 
feedback networks can be termed as recurrent. A recurrent network is a discrete- 
time dynamical system, which, at any given instant of time, is characterized 
by a binary output vector. Examples of such networks are illustrated in Fig- 
ures 2.10, 2.11, and 2.12. The recurrent networks’ sequential updating process 
described in the preceding paragraph can be considered either discrete syn- 
chronous or discrete asynchronous in time. A more extensive analysis of recurrent 
single-layer feedback networks with discrete neurons and an explanation of their 
update dynamics will be pursued in Chapter 6 in the framework of associative 
memory architectures. 

The discussion of single-layer feedback networks in this chapter focuses on 
networks operating in continuous time and with continuous output vector values. 
Accordingly, continuous activation functions are assumed for such networks and 
the updating is simply continuous in time at every network output. Noticeably, 
network outputs are also neuron outputs since the network is with single layer. 

Examples of continuous-time single-layer feedback networks are shown in 
Figures 2.14 and 2.15. In this chapter we will look at the relevant facets, such 
as dynamics, design, and updating schemes, of single-layer neural systems that 
store knowledge in their stable States. These networks fulfill certain assumptions 
that make the class of networks stable and useful, and their behavior predictable 
in most cases. 

Remember, however, that fully coupled single-layer neural networks repre- 
sent nonlinear feedback systems. Such systems are known, in general, to possess 
rather complex dynamics. Fortunately, single-layer feedback networks represent 
a class of networks that allows for great reduction of the complexity. As a result, 
their properties can be controlled and solutions utilized by neural network design- 
ers. This presents possibilities for solving optimization problems and applying 
such networks to modeling of technological and economical systems. 
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The networks discussed here are based on the seminal papers of Hop- 
field (1984, 1985, 1986). However, many years of development in the areas 
of continuous- and discrete-time systems have contributed to the existing state- 
of-the-art single-layer feedback networks. As we will see, the networks can be 
useful in many ways. They can provide associations or classifications, optimiza- 
tion problem solution, restoration of pattems, and, in general, as with perceptron 
networks, they can be viewed as mapping networks. Despite some unsolved prob- 
lems and limitations of the fully coupled single-layer networks, their impressive 
performance has been documented in the technical literature. Both hardware 
implementations and their numerieal simulations indicate that single-layer feed- 
back networks provide a useful alternative to traditional approaches for pattem 
recognition, association, and optimization problems. 


5.1 

BASIC CONCEPTS OF 
DYNAMICAL SYSTEMS 

The neural networks covered in this chapter are dynamical systems. As such, 
they process the initial condition information over time while moving through a 
sequence of States. Let us look at a classical problem of recognition of a distorted 
alphanumerical character. Characters shown in bit map form can be represented 
as vectors consisting of the binary variables 0 and 1. Specific example letters to 
be recognized are presented in Figure 3.2(a), Figure 4.19, and in Example 4.5, 
Figure 4.23(a). Taking a different perspective than in Chapter 4, we will now 
design a dynamical system so that its States of equilibrium correspond to the set 
of selected character vectors. 

Let us consider several basic concepts before we examine their definitions 
more closely in Chapter 6. An attractor is a State toward which the system 
evolves in time starting from certain initial conditions. Each attractor has its set 
of initial conditions, which initiates the evolution terminating in that attractor. 
This set of conditions for an attractor is called the basin of attraction. If an 
attractor is a unique point in State space, then it is called a fixed point. However, 
an attractor may consist of a periodic sequence of States, in which case it is 
called the limit cycle, or it may have a more complicated structure. 

Using the concept and properties of single-layer feedback networks, a dy- 
namical system can be postulated and designed so that it has three vectors 
representing the letters C, I, and T from Figure 4.19 as fixed points. Moreover, 
the system can have its basins of attractions shaped in such a way that it will 
evolve toward an appropriate character vector even if the initializing vector does 
not exactly resemble the original character. Thus, a certain amount of noise or 
distortion of the character vector not only can be tolerated by the system, but even 
removed under the appropriate conditions. In our approach we are more interested 
in the conclusions that systems can reach than in the transient evolutions from 
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any initial conditions that are provided to initialize the transients. However, to 
enhance the understanding of dynamical network performance, a study of both 
aspects is necessary. 

Let us note that the information about the stationary point of a dynamical 
system must be coded in its intemal parameters and not in its initial condi- 
tions. A nontrivial recognition or association task will obviously require a large 
amount of information to be stored in system parameters, which are its weights. 
Therefore, dynamical neural networks need to have their numerous parameters 
selected carefully. Accordingly, their State sequences will evolve toward suitable 
attractors in multidimensional space. In addition, the system will have to be 
inherently nonlinear to be useful. 

The leaming of parameters of a dynamical system is dependent on the net- 
work model adopted. A variety of leaming methods is applied to store information 
in single-layer feedback networks. Although correlation or Hebbian leaming is 
primarily used, many applications would require a highly customized leaming 
approach. Typically, the leaming of dynamical systems is accomplished without 
a teacher. The adjustment of system parameters, which are functions of pattems 
or associations to be leamed, does not depend on the difference between the 
desired and actual output value of the system during the leaming phase. 

One of the inherent drawbacks of dynamical systems is their very limited 
explanation capability. The solutions offered by the networks are hard to track 
back or to explain and are often due to random factors. This can make it difficult 
to accept the conclusion, but it does not reduce the networks’ levels of perfor- 
mance. However, it should be stressed that the dynamical systems approach to 
cognitive tasks is still in an early development stage. The application of the 
dynamical models to real-size optimization or association problems beyond their 
present scope will require a great deal of further scientific development. 


5.2 — — — i 

MATHEMATICAL 
FOUNDATIONS OF 
DISCRETE-TIME 
HOPFIELD NETWORKS 

The attention given currently to this class of networks and to their mathe- 
matical model is due to their very interesting intrinsic properties. Folio wing the 
development of the theory of this class of networks in the early and mid-1980s, 
modem microelectronic and optoelectronic technology has made it possible to 
fabricate microsystems based on the formulated network model (Howard, Jackel, 
and Graf 1988; Alspector et al. 1988). This section reviews the main properties of 
the continuous-time dynamical system model. A discussion of feedback network 
implementation issues is provided in Chapter 9. 
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Figure 5.1 Single-layer feedback neural network. 


Following the postulates of Hopfield, the single-layer feedback neural net- 
work is assumed as shown in Figure 5.1. It consists of n neurons having threshold 
values Tj. The feedback input to the fth neuron is equal to the weighted sum 
of neuron outputs Vj, where j = 1,2,...,«. Denoting w tj as the weight value 
connecting the output of the /th neuron with the input of the Fth neuron, we 
can express the total input netj of the fth neuron as 

net i = X WijVi + ij ~ T h for i = 1, 2, . . . , n (5.1a) 

y=i 
J*i 

Note that this notation is consistent with the double subscript notation used in 
Chapters 2, 3, and 4. The extemal input to the Fth neuron has been denoted here 
as Introducing the vector notation for synaptic weights and neuron output, 
Equation (5.1a) can be rewritten as 

netj = w'v + ij - 7), for i = 1,2,...,« (5.1b) 

where 

w,,' 

w a 


A 

W ; . = 
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is the weight vector containing weights connected to the input of the z’th neuron, 
and 


A 

v = 


Vl 

V 2 



is the neural network output vector, also called the output vector. The complete 
matrix description of the linear portion of the system shown in Figure 5.1 is 
given by 


net = Wv + i - t 


(5.2) 


where 


" net x " 


'h' 

net 2 

. A 

h 

: 

, 1 = 


.net„_ 


Jn. 


. A 

net = 


are vectors containing activations and extemal inputs to each neuron, respectively. 
The threshold vector t has been defined here as 


t = 


T 

L 1 n J 


Matrix W, sometimes called the connectivity matrix , is an n X n matrix containing 
network weights arranged in rows of vectors w' as defined in (5.1b) and it is 
equal to 


W = 


w 

w! 




In the expanded form this matrix becomes 



' 0 

w 12 

w 13 

• • ■ W ln 


w 21 

0 

^23 

’ • ’ W 2 n 

w = 

W31 

W32 

0 

■■■ W 3 „ 


i 

W n 2 

3 

... 0 


Note that we assume the weight matrix W in this model that is symmetrical, 
i.e., w t j = Wj i9 and with diagonal entries equal explicitly to zero, i.e., w H — 0. (See 
Figure 5.1; no connection exists from any neuron back to itself). Physically, this 
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condition is equivalent to the lack of the self-feedback in the nonlinear dynamical 
system of Figure 5.1. Summarizing, the /th neuron output is connected to each 
of the neurons’ inputs through a multiplicative sy naptic weight w ip for i = 1, 2, 
. . . , n , but it is not connected to its own input. 

Let us assume momentarily that the neuron’s activation function is sgn(-), 
like this of a TLU element. This causes the following response, or update, of 
the fth neuron excited as in (5.1) or (5.2): 

v f 1 if < 0 

V; — ► + 1 if net t > 0 

Transitions indicated by right arrows based on the update rule of (5.3) are taking 
place at certain times. If the total input to a particular neuron gathered additively 
as a weighted sum of outputs plus the extemal input applied is below the neuron’s 
threshold, the neuron will have to move to, or remain in, the inhibited State. The 
net value as in (5.1a) exceeding zero would result in the excitatory State -hl 
immediately after the update. Let us note that the resulting State + 1 of the fth 
neuron is either preceded by the transition or not. 

The rule (5.3) of interrogating the neurons’ weighted outputs and updating 
the output of the fth neuron is applied in an asynchronous fashion. This means 
that for a given time, only a single neuron is allowed to update its output, and 
only one entry in vector v as in (5.1b) is allowed to change. The next update in 
a series uses the already updated vector v. In other words, under asynchronous 
operation of the network, each element of the output vector is updated sepa- 
rately, while taking into account the most recent values for the elements that 
have already been updated and remain stable. This mode of update realistically 
models random propagation delays and random factors such as noise and jitter. 
Such phenomena would indeed be likely to occur in an artificial neural network 
using high-gain neurons described with an activation function close to sgn {net). 

Formalizing the update algorithm (5.3) for a discrete-time recurrent network 
and using (5.1), we can obtain the following update rule: 

vf +1 = sgn (wjv* + - T z ) , for i = 1, 2, . . . , n and k = 0, 1, . . . (5.4a) 

where superscript k denotes the index of recursive update. 

The update scheme in (5.4a) is understood to be asynchronous, thus taking 
place only for one value of i at a time. Note that the right arrow in (5.3) 
has been replaced by the delay between the right and left sides of (5.4a). The 
update superscript of v* +1 refers here to the discrete-time instant, and it could 
be replaced with v t [{k + 1 )f], or simply with Vj(k +1), where t denotes the 
neurons’ update interval assumed here of unity value. The recursion starts at v°, 
which is the output vector corresponding to the initial pattem submitted. The 
first iteration for k = 1 results in v? , where the neuron number, /, is random. The 
other updates are also for random node number j, resulting in updates vj , j =£ i , 
until all updated elements of the vector v 1 are obtained based on vector v°. This 
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particular update algorithm is referred to as an asynchronous stochastic recursion 
of the Hopfield model network. 

The matrix equation (5.2) can be used as an alternative to express the 
recursive update algorithm (5.3). In such a case we have 


v*+i = r [ Wv * + i - t], 


for k = 0, 1, . . . , 


(5.4b) 


where sgn(-) operates on every scalar row of the bracketed matrix. However, 
caution should now be exercised since formula (5.4b) describes the synchronous, 
or parallel, update algorithm. Under this update mode, all n neurons of the layer, 
rather than a single one, are allowed to change their output simultaneously. 
Indeed, starting at vector v°, entries of vector v 1 are concurrently computed ac- 
cording to (5.4b) based on the originally initialized v° value, then v 2 is computed 
using v 1 , etc. 

The asynchronous update scheme requires that once an updated entry of 
vector v* +1 has been computed for a particular step, this update is substituted 
for the current value and used to calculate its subsequent update. This process 
should continue until all entries of v* +1 have been updated. The recursive com- 
putation continues until the output node vector remains unchanged with further 
iterations. It thus can be said that when using formula (5.4b), each of the recursion 
steps for k = 0, 1, . . . should be divided into n individual, randomly sequenced 
single neuron updates. 

It is rather illustrative to visualize the vector of neuron outputs v in n- 
dimensional Space. The output vector is one of the vertices of the ^-dimensional 
cube [— 1, 1] in E n space. The vector moves during recursions (5.4) from vertex 
to vertex, until it stabilizes in one of the 2 n vertices available. Note that the 
movement is from a vertex to an adjacent vertex since the asynchronous update 
mode allows for a single-component update of an rc-tuple vector at a time. The 
final position of v*, as k — > oo ? is determined by weights, thresholds, inputs, and 
the initial vector v°. It is also determined by the order of transitions. 

Some of the open questions are whether the system has any attractors, and 
whether it stabilizes, which we have assumed so far without proof; and, if so, is 
there any link between the initial pattem v° and its final value v k ,k — * oo. Further, 
it is interesting to see how the equilibrium is reached and how many attractors 
can be stored, if any, in the system discussed. Finally, we will be understandably 
interested to see how networks of this type can be designed and their weights 
computed to suit specific mapping needs. 

To evaluate the stability property of the dynamical system of interest, let 
us study a so-called computational energy function. This is a function usually 
defined in Az-dimensional output space v n . The motivation for such choice of space 
is that system specifications are given most often in terms of its desired outputs 
and are usually available in output space as opposed to the neuron input space, 
which is the State space. Also, the space v n is bounded to within the [—1,1] 
hypercube, including its walls, edges, and vertices. If the increments of a certain 
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bounded positive-valued computational energy function under the algorithm (5.3) 
are found to be nonpositive, then the function can be called a Liapunov function, 
and the system would be asymptotically stable (see the Appendix). 

The scalar-valued energy function for the discussed system is a quadratic 
form (see the Appendix) and has the matrix form 


1 


v'Wv - fy + fy 


(5.5a) 


or, in the expanded form, it is equal to 


A 1 n n n n 

£ = -- I I WyViVj - S !>,- + 1 t,-V, 
£ 1=1 j—l 1=1 1=1 

i*j 


(5.5b) 


Let us study the changes of the energy function for the system which is allowed 
to update. Assume that the output node i has been updated at the A’th instant 
so that vf +1 — v* = Av,-. Since only the single neuron computes, the scheme is 
one of asynchronous updates. Let us determine the related energy increment in 
this case. Computing the energy gradient vector, we obtain from (5.5a) (see the 
Appendix) 


VE = ~-(W' + W)v 


i f + 1' 


(5.6a) 


which reduces for symmetrical matrix W for which W' = W to the form 


VE - -Wv - i' + t‘ 


(5.6b) 


The energy increment becomes equal: 


AE = (V£)'Av 

Since only the /’th output is updated, we have 

0 


Av = 


Av,- 


0 

and the energy increment (5.6c) reduces to the form 

AE = (-wjv - i\ + ø Av i 

This can be rewritten as: 

AE = - f I W'jVj + i, - Av,-, for j # i 


(5.6c) 


(5.6d) 


(5.6e) 


or briefly 
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Inspecting the update rule (5.3), we see that the expression in parentheses in 
(5.6d) and (5.6e), which is equal to net r and the current update value of the 
output node, Av,, relate as follows: 


(a) when net- t < 0, then Av, ^ 0, and 

(5.7) 

(b) when net ( > 0, then Av, > 0 

The observations stated here mean that under the update algorithm discussed, 
the product term ner, Av, is always nonnegative. Thus, any corresponding energy 
changes AE in (5.6d) or (5.6e) are nonpositive. We therefore can conclude that 
the neural network undergoing transitions will either decrease or retain its energy 
E as a result of each individual update. 

We have shown that the nonincreasing property of the energy function E 
is valid only when w t j — w ;7 . Otherwise, the proof of nonpositive energy incre- 
ments does not hold entirely. Indeed, if no symmetry of weights is imposed, the 
corresponding energy increments under the algorithm (5.3) take on the following 
value: 


AE - — 


n 1 

I 2 (w ‘J + w j^ v J + l ' ~ 


Av,- 


(5.8a) 


and the term in brackets of (5.8a) is different from net v Therefore, the energy 
increment [(5.6d) and (5.6e)] becomes nonpositive under rule (5.3) if w,y = w y7 
without any further conditions on the network. The asymmetry of the weight 
matrix W, however, may lead to the modified neural networks, which are also 
stable (Roska 1988). Since the performance of the modified and original sym- 
metric networks remains similar and it is much easier to design a symmetric 
network, we will focus further consideration on the symmetric connection model. 
The additional condition w u = 0 postulated for the discussed network will 
be justified later in this chapter when the performance of the continuous-time 
model performance network is discussed. For the asynehronous update scheme, 
one of the network’s stable States which acts as an attractor is described by 
the solution 


v* +1 = lim sgn(Wv* + i - t) (5.8b) 

We have found so far that the network computing rule (5.3) results in a 
nonincreasing energy function. However, we must now show that the energy 
function indeed has a minimum, otherwise the minimization of the energy func- 
tion would not be of much use. Let us note that since the weight matrix W is 
indefinite because of its zero diagonal, then the energy function E has neither a 
minimum nor maximum in unconstrained output space. This is one of the prop- 
erties of quadratic forms such as (5.5). However, the function E is obviously 
bounded in ^-dimensional space consisting of the 2 n vertices of ^-dimensional 
cube. Thus, the energy function has to reach its minimum finally under the update 
algorithm (5.3). 
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Figure 5.2 Example of recursive asynchronous update of corrupted digit 4: (a) k = 0, 
(b) k = 1 , (c) k = 2, (d) k = 3, (e) k = 4, and (f) k = 5. 


Example snapshots of neural network transitions were given in Figure 1.8. 
Another example of convergence of a 10 X 12 neuron network is illustrated in 
Figure 5.2. It shows the 10 X 12 bit map of black and white pixels representing 
the digit 4 during the update as defined in (5.3). Figure 5.2(a) has been made for 
k — 0, and it shows the initial, distorted digit 4 with 20% of the pixels randomly 
reversed. Consecutive responses are shown in Figure 5.2(b) through (f). It can 
be seen that the updates continue until k — 4 as in Figure 5.2(e), and for k > 5 
no changes are produced at the network output since the system arrived into one 
of its stable States (5.8b). 

It has been shown in the literature that the synchronous State updating al- 
gorithm can lead to persisting cyclic States that consist of two complementary 
pattems (Kamp and Hasler 1990). Consider the 2 X 2 weight matrix with zero 
diagonal and off-diagonal entries of -1, and the synchronous updates of the 
output vector v° = [ — 1 - 1 ] \ By processing the signal v° once we obtain 


0 -f 

-r 


sgn(l) 


T 

-1 0 

-i 


sgn(l) 


l 
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In the following update of v 1 we obtain 

-f 

-1 

Thus we obtain the same vector as v°, and the synchronous update has produced 
a cycle of two States rather than a single equilibrium State. The cycle consists of 
two complementary States. Also, we will see later that complementary pattems 
correspond to identical energy levels. The following example of a discrete-time 
neural network illustrates how the energy function and the update scheme (5.3) 
are related while network is undergoing asynchronous updating. 


0 -1 

1' 


sgn(-l) 

-1 0 

1 


sgn(-l) 


EXAMPLE 5.1 


Let us look at the energy distribution and output updating process of the 
fully coupled single-layer network from Example 2.2. Recall that its weight 
matrix is 

0 

w = 1 ( 

1 

-1 - 

For the threshold and extemal inputs 
(5.5) becomes 

£(v)=-I[v x v 2 v 3 v 4 ] 

and after rearrangements we obtain 

E(\) = -VjOa + v 3 - v 4 ) - v 2 (v 3 - v 4 ) + v 3 v 4 (5.9c) 

The system produces discrete energy levels of value —6, 0, and 2 for bipolar 
binary neurons used in this example. The reader can verify this statement 
by analyzing the energy le vels (5.9c) for each binary vector starting at 
[—1 —1 —1 — 1 ]* and ending at [ 1 1 1 1 ] f . Figure 5.3 shows the 

energy levels computed for each of the 2 4 binary vectors, which are vertices 
of the four-dimensional cube that the four-neuron network can represent. 

Each edge of the State diagram depicts the single asynchronous State 
transition. As can be seen from the figure, which shows energy values 
marked at each cube vertex, the transitions under rule (5.3) indeed displace 
the State vector toward a lower energy value. Since the energy value de- 
creases at each transition, the network seeks the energy minimum of —6, 


(5.9a) 


1 1 -1 

) 1 -1 

1 0 -1 

1 -1 0 

assumed zero, the energy function 


0 11-1 
10 1-1 
110-1 
-1 -1 -1 0 



vi 


v 2 


V 3 


_ v 4 _ 


(5.9b) 
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[-l-l-l-l]' 

( 0 ) 



which is either at o t = [ 1 1 1 - 1 ] ' or at o 2 = [ - 1 - 1 - 1 1 ] 

Therefore, the example network always stabilizes at energy level E = — 6 
provided transitions are asynchronous. 

Note that synchronous transitions, which were originally considered in 
Example 2.2, neither minimize the energy function nor are they stable. The 
reader may easily verify that starting at v° = [ 1 - 1 1 1 ] the first 

synchronous transition results in 


-11 



-1 


and then the network starts oscillating between v° and v 1 since 

r n 



i 
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Thus, the energy stabilizes at the nonminimum level of 2 associated with 
each of the two complementary States. 

In contrast to the observed phenomena, asynchronous transitions al- 
ways lead to one of the energy minima. Moreover, under the asynchronous 
and random updating algorithm, the convergence to a minimum of E is 
essentially unique and nonrepetitive. Specifically, the sequence of updat- 
ing bits determines to which of the equilibrium State minima the network 
converges. Indeed, let us look at the movement of the network starting at 
v°= [-1 1 1 1 ]\ Inspecting the transition map of Figure 5.3 we see 

that if the updates begin with the first bit, then the transitions have to end 
up at Oj. If, however, the second bit of v° is allowed to update first, the 
transitions end up at o 2 . wm 


5.3 — — — — 

MATHEMATICAL 
FOUNDATIONS OF 
GRADIENT-TYPE 
HOPFIELD NETWORKS 

Gradient-type neural networks are generalized Hopfield networks in which 
the computational energy decreases continuously in time. Time is assumed to 
be a continuous variable in gradient-type networks. Such networks represent a 
generalization of the discrete-time networks introduced in the preceding section. 
As discussed in Chapter 2, the discrete-time networks introduced in the previous 
section can be viewed as the limit case of continuous-time networks. For a 
very high gain A of the neurons, continuous-time networks perform similarly to 
discrete-time networks. In the limit case, A — ► oo, they will perform identically. 
The continuous-time single-layer feedback networks , also called gradient-type 
networks , are discussed in this section. 

Specifically, gradient-type networks converge to one of the stable minima 
in the State Space. The evolution of the system is in the general direction of the 
negative gradient of an energy function. Typically, the network energy function 
is made equivalent to a certain objective (penalty) function that needs to be min- 
imized. The search for an energy minimum performed by gradient-type networks 
corresponds to the search for a solution of an optimization problem. Gradient- 
type neural* networks are examples of nonlinear, dynamical, and asymptotically 
stable systems. The concept, fundamentals, and examples of such networks will 
be discussed below. In the discussion of their properties, we will again use the 
scalar energy function E[\(t)] in the n-dimensional output space v n . 

Although single-layer feedback networks can be completely described by a 
set of ordinary nonlinear differential equations with constant coefficients, mod- 
eling of such a set of equations by a physical system provides us with better 
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Figure 5.4a The neural network model using electrical components: (a) diagram. 


insight into such networks’ behavior. It also provides the link between the theory 
and implementation. The model of a gradient-type neural system using electrical 
components is shown in Figure 5.4(a). It consists of n neurons, each mapping its 
input voltage u t into the output voltage v, through the activation function 
which is the common static voltage transfer characteristic (VTC) of the neuron. 
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(b) 


Figure 5.4b The neural network model using electrical components (continued): (b) input node 
of the fth neuron. 


To preserve the original notation of seminal Hopfield’s papers, the variable 
u is used for this network to denote a neuron ’s activation; for other neural 
networks the symbol net is adopted in this text. Any high-gain voltage amplifier 
with saturation could be used in this model as a replacement for a neuron. 
Conductance w t j connects the output of the / th neuron to the input of the i’th 
neuron. The inverted neuron outputs v z - are usually tapped at inverting rather 
than noninverting outputs to avoid negative conductance values w t j connecting, 
in inhibitory mode, the output of the / th neuron to the input of the V th neuron. 
Note that the postulated network is required to be symmetric, i.e., w ( j = 
Also, since it has been assumed that w it = 0, the outputs of neurons are not 
connected back to their own inputs. 

Let us note that the electrical model of the Hopfield network shown in the 
figure has conductance g t between the f th neuron input and ground. It represents 
the nonzero input conductance of the f th neuron. With the conductance extracted, 
the neuron can be considered as a parasitic-free voltage amplifier absorbing no 
current. Similarly, C x represents the nonzero input capacitance of the f th neuron. 
Capacitances Q, for i = 1, 2, ...,«, are responsible for the dynamics of the 
transients in this model. 

Let us derive mathematical description of this network. The input of the f th 
neuron is shown in Figure 5.4(b). The KCL equation for the input node having 
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potential «, can be obtained as 


/; 4 - 


n 

S WijVj 
j— 1 
j*i 


W. 


\ j + Si = C i 


d«A 
d t ) 


(5.10) 


The left side of Equation (5.10) represents the total current entering the 
capacitance C ( . It is a sum of n — 1 currents of value (vj — h ( )h' (/ , for j — 1, 
2, . . . , n, j + i, the current and the current -g,«,. For the sake of clarity, 
all the branch currents involved in the summation at the left side of (5.10) have 
been marked in Figure 5.4(b). The reader can easily notice that the feedback 
current term ZjLjfy - «,-)w,y, j ¥= i, has been split in (5.10) to properly group 
terms with time variables. Since the weights w tj transform the voltage Vj - Uj 
into currents, they have dimensions of conductances and are expressed in this 
example in mhos. Denoting the total conductance connected to the neuron ’s input 
node i as G,, where 


G, -* S w ij + 8b 
M 

Equation (5.10) can be simplified to the form of a single State equation as 


. , " „ ^ /d«A 

1 I w ij v j ~ u iGi — C i ^ j 


(5.11) 


Let us now introduce the matrices C and G defined as 

C = diag[Cj, C 2 , . . . , C„] and G = diag [G x , G 2 , . . . , G n ] 


and arrange «,(?), v,(r), and i, in n-dimensional column State vector u(f), output 
vector v(t), and the bias current vector i. The final equations of the entire model 
network consisting of the State equation (5.12a) and the output equation (5.12b) 
written in matrix form can now be expressed from (5.11) for i = 1,2, . . . , n as 

du(f) 

C ~åt = WV(0 “ Gu(0 + ‘ (5 ‘ 12a) 

v(0 = f[u(0] (5.12b) 


Let us study the stability of the system described by the ordinary nonlinear dif- 
ferential equations (5.12). The stability of the equations can be evaluated using 
a generalized computational energy function £[v(/)]. The function £[v(r)] has 
not yet been determined in our discussion. We do not know the form of the 
function; however, we assume that it exists. It is obviously, as before, a scalar- 
valued function of output vector v. The time derivative of £[v(t)] can easily be 
obtained using the chain rule as 


d £[v(t)] = » My) . 

d / ,=i dVj V ‘ 


(5.13a) 
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where 



This may be written compactly as a scalar product of the vectors 

d£[v(0] - pt . , . 

— — — = V£ (v) v 
at 


where V£(v) denotes the gradient vector 


(5.13b) 


V£(v) = 


dE(y) 

dv { 

dE(\) 

dv 2 


dE(\) 

dv n 


Note that the computational energy function E[\(t ) ] is defined in ^-dimensional 
output space v n where the designed neural system specifications are usually 
known. The corresponding energy function in the State space u n of an asymp- 
totically stable system would be the system ’s Liapunov function. If its time 
derivative is found to be negative, then the energy function in the output space 
also has a negative derivative, since Equation (5.12b) describes a monotonic 
mapping of space u n into space v n . The formal verification of this property will 
be presented below. 

The suitable energy function for the system in Figure 5.4 has the following 
form: 

£(v) = - Iv'Wv - i'v + £ G, dz (5.14) 

^ 1=1 Jo 

where fr l (z) is the inverse of the activation function f { . Although the time ar- 
gument has been omitted in (5.14), for notational convenience, it is understood 
that vector v represents v(t). Let us see why and how the energy function (5.14) 
can provide us with insight into the system of Equation (5.11). 

First, the threshold term of (5.5a) has for simplicity been absorbed into the 
Fv term in (5.14). Also, the third term of the energy expression containing the 
integral of the inverse of the activation function has been introduced in (5.14) to 
account for the property of continuous activation function of neurons. We may 
now evaluate how the energy function (5.14) varies with time. Calculating the 
rate of change of energy for symmetric W using the chain rule and the property 

é, ( G 'f*‘ ,<z)dz ) = 
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yields the following result for the time derivative of the energy (see the Ap- 
pendix): 

d E 

— = (-Wv-i + Gu)' v (5.15) 

Noticeably, the expression for the gradient vector has now resulted within the 
parentheses of (5.15) before the transpose. The State equation (5.12a) can now 
be rewritten using the gradient symbol: 

— V£(v) = (5.16) 

dt 

From (5.16) we see that the negative gradient of the energy function E(\) is 
directly proportional to the speed of the State vector u, with the capacitance C i 
being the proportionality coefficient of the State vector ’s fth component. The 
appropriate changes of u i are in the general direction of the negative gradient 
component dE/d \ The velocity du/d t of the State vector is not strictly in the 
negative gradient direction, but always has a positive projection on it (see the 
Appendix). This is illustrated in Figure 5.5(a), which shows the equipotential, 
or equal energy, lines in v 2 Space of an example neural network consisting of 
two neurons. Other examples of this class of networks and its properties will be 
discussed in more detail later in this chapter. 

It is now interesting to see how the energy E of this system varies with 
time. Combining (5.13) and (5.15) yields 


dE _ _ / d uV dv 
d t \ dt ) dt 


(5.17) 


Using the inverse activation function of the neuron’s VTC, the individual 

directions of changes of the output vector v(f), or velocities, can now be computed 
as follows: 


C 'W - 


(5.18) 


Sine Qf~ v (Vj) > 0, it can be noticed that d«//d t and dvjdt have identical signs. 
But, as stated previously in (5.16), the vålues of dujdt are of the same sign as 
the negative gradient components dE/dVj . It follows thus that the changes of £, 
in time, are in the general direction toward lower values of the energy function 
in v n space. The changes in time, however, do not exaetly follow the direction 
of the gradient itself, since the term C/ -1 ^) provides here such scaling that 


dv t _ dE/dvi 
dt Cj- X \vi) 


(5.19) 


This energy minimization property of the system discussed here motivates the 
name of a gradient-type system. The property of energy minimization makes it 
possible to replace the evaluation of rather complex convergence to the minimum 
of E versus time , like it really occurs in an artificial neural system, with the 



Figure 5.5 Illustration of energy gradient descent in v 2 space: (a) descent along a negative 
gradient, (b) minimum point, and (c) saddle point. 








5.3 MATHEMATICAL F0UNDAT10NS OF GRADIENT-TYPE HOPFIELD NETWORKS 


271 


convergence to the minimum of E(\) in the output space. It can also be noticed 
that such a solution for the minimum E(\) results from solving the nonlinear, but 
purely algebraic, equations describing the nondynamic system, i.e., the system 
as shown in Figure 5.4(b) but with the capacitors removed. 

It is interesting to note that the E function also has the meaning of the so- 
called eocontent function (Kennedy and Chua 1988). Remarkably, the stationary 
point of the system’s dynamics may be examined by tinding the zero value 
of the system’s energy gradient vector V£(v). Since AE låt < 0, the energy 
function decreases in time, unless AE/dt — 0. At the network’s attractor, where 
the condition jAE/åt = 0 holds, we also have Aujåt = 0, for i = 1, 2, ..., n, 
as can be seen from (5.17). The attractor of the network must be a point so that 
any oscillations or cycles are excluded, and the network is totally stable. 

As the system comes to rest at the minimum of E(y) after it has moved 
toward a minimum in the output space, it has also reached equilibrium in the 
neuron’s input, or State, space u. This can be seen from expression (5.19). Be- 
cause the system seeks out one of its energy minima, it is instructive to see the 
relationship between the weight matrix W and the minima of the system’s energy 
function. Below, we discuss the conditions of existence of such minima. 

As derived in previous paragraphs, tinding an equilibrium point for the 
dynamic network with continuous activation function neurons corresponds to 
tinding the solution v* of the following algebraic equation 

V£(v) = 0 (5.20a) 

Speaking more precisely, an equilibrium point v* is called the stationary point. 
The stationary point can be a maximum, minimum, or saddle. The case of a 
minimum is illustrated on the contour map of the energy function of Figure 
5.5(b). Figure 5.5(c) shows the stationary point which is the saddle. Although 
the gradient vector vanishes at the saddle point, the dynamical system does not 
stabilize there. On the figure shown, the system will tend to move toward the 
upper left or bottom right comer of the map. 

Let us find the stationary points of the energy function. Using the gradient 
vector value computed from (5.14), Equation (5.20a) can be rewritten as 

-Wv - i + Gu = 0 (5.20b) 

The additional relationship (5.12b) is mapping the State space into the output 
space, u n — > v n , and along with (5.20b) they can be used for finding v*, which 
are the stationary points of the energy function. 

Solving Equation (5.20) without consideration of the constraints on the so- 
lution as belonging to the v n hypercube yields 

Wv = -i (5.21a) 

and the solution v* is obtained as 


V* = — W _1 i 


(5.21b) 
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The additional assumption has been made in (5.21) of high-gain neurons, which 
eliminates the last term in (5.20b). It can be also noted in such cases that the 
Hessian matrix (see the Appendix) of the unconstrained system using high-gain 
neurons is 

V 2 E(v) = V [— Wv — i] (5.22a) 

which yields 

V 2 E(v) = -W. (5.22b) 

By having previously chosen w a - 0 for the di§cussed neural network model, 
the Hessian matrix of the energy function equal to — W has been made neither 
positive definite nor negative definite (see the Appendix). This is resulting from 
our previous assumption that the postulated system has no self-feedback. In faet, 
symmetric matrix W with zero diagonal produces scalar-valued energy function 
of quadratic form (— V£)v*Wv — Pv, which has neither minima nor maxima since 
W is neither positive nor negative definite. Thus, the energy function £(v) pos- 
sesses no unconstrained minima. This, in tum, means that in the case of limitation 
of the output space of the system to a [—1,1] hypercube, the constrained min- 
ima of E(\) must be located somewhere at the hypercube’ s boundary, or in other 
words, on the edges or faces of the cube. If a solution v* of the constrained 
system within the cube [—1,1] exists, then it must be a saddle point. Clearly, 
this solution does not represent an attractor for the discussed dynamical system 
(Park 1989). 

Let us try to examine the locations of the stationary points of a high-gain 
neural network. The energy function (5.14) without its third term, and evaluated 
on the faces and on the edges simplifies to the following expressions, respectively, 

E = + WjjVjVj) - k t v, - kjVj (5.23a) 

E = - V; + c (5.23b) 

where k t is constant. Indices i and y, for i, j = 1, 2, . . . , n, in Equation (5.23) 
determine the face (/,/) or the edge (/), of the ^-dimensional cube. Since the 
Hessian matrix for the two-variable function E(v i9 vj) as in (5.23a) has again a 
zero diagonal, no minima can exist on any of the faces of the cube. Further, since 
(5.23b) is a monotonic and linear function of a single variable v { -, the minima 
may not occur on the edges of the cube either. Thus, the minima, if they exist, 
must be confined to the vertices of the cube. 

We have shown that the assumption of high-gain neurons resulting in ne- 
glecting the last term of energy in (5.14) enforces the constrained minima to 
be exaetly in the vertices of the [-1, 1] cube having a total of 2 n vertices. The 
high-gain network that approaches, in the limit case, the discrete-time system 
is expected to have its energy minima in the cube comers. The convergence 
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observed in actual gradient-type network hardware should result in one such 
cube comer solution. 

The gradient-type network with finite neuron gain values also evolves in 
the continuous-time mode to one of the minima of £(v). However, the minima 
are now within the cube and are attractors of the system. Such minima are 
usually desirable if they are as close to the vertices of the hypercube as possible. 
Understandably, they will always be within the [—1,1] cube, and for A — ► oo 
they will reach the cube comers. 

Concluding this discussion of stationary points of energy function for single- 
layer feedback networks we notice that the solutions of Equations (5.20) and 
(5.12b) always represent either the constrained minimum or saddle of the energy 
function. There is, in general, more than one solution to (5.20) and (5.12b). The 
solution reached by an actual neural network is dependent on network parameters 
and on the initial condition within the network. Some of the solutions produced 
by neural networks of this class are useful, so they will be desirable. Other 
solutions will be less useful and may even be erroneous, although sometimes 
hard to avoid. This property applies both to discrete-time and continuous-time 
rules of copvergence. We have considered a number of properties of single- 
layer feedback neural networks that will bring in a number of interesting results 
and applications. While these results and applications will be studied later, the 
example below will highlight some of the aspects just discussed. 


EXAMPLE 5.2 


In this example we design a simple continuous-time network using the con- 
cept of computational energy function and also evaluate stationary solutions 
of the network. The A/D converter circuit discussed below was first pro- 
posed in Tank and Hopfield (1986). 

Let us assume that the analog input value x is to be converted by a two- 
neuron network as in Figure 5.1 or 5.4(a) to its binary representation. In this 
example it will be assumed that the activation function f{u t ) is continuous 
unipolar and defined between 0 and 1. The A/D conversion error E c can 
be considered here as an energy function. We can thus express it by the 
following formula 

( 5 - 24a ) 

where the v 0 + 2v 1 is the decimal value of the binary vector [v x v 0 ] 1 and 
it corresponds to the analog x value. 

The reason for this choice of the energy function is that minimizing the 
energy will simultaneously minimize the conversion error. Our objective is 


274 


SINGLE-LAYER FEEDBACK NETWORKS 


to find matrix W and the input current vector i for a network minimizing the 
error E c . Evaluation of the error (5.24a) indicates, however, that it contains 
terms and v 2 v thus making the w u terms equal to l 2i instead of equal 
to zero as postulated by the network definition. Therefore, a supplemental 
error term E a , as in (5.24b), will be added. In addition to eliminating the 
diagonal entries of W, this term will be minimized close to the comers of 
the [0, 1] square on v 0 , v x plane. A suitable choice for E a is therefore 

E a = S 2 2 'v,(v,. - 1) (5.24b) 

Indeed, if we note that E a is nonnegative we can see that it has the lowest 
value for v z = 0 or v t = 1. Thus, combining E a and E c , chosen as in (5.24), 

results in the following total energy function, which is minimized by the 

neural network A/D converter: 

E = \ (x - i v,-2') - \ i 2 2, v,.(v,. - 1) (5.25a) 

2 V 1=0 / 2,-= 0 

This expression for energy should be equated to the general formula for 
energy as in (5.14), which is equal in this high-gain neuron case: 

E = I I W'jVjVj - £ j ( v, (5.25b) 

2 i= 0 7=0 i= 0 

}*i 

Note that the energy function (5.14) in the high-gain case retains only two 
terms, and the term with the integral vanishes because of the high-gain neu- 
rons used. Routine rearrangements of the right side of the energy expression 
(5.25a) yield 

E=\x l + \i i 2' + V, + X (2 2 '" 1 - 2'x) v, (5.26) 

2 2 1=07=0 1= 0 V 1 

When the analog input x is fixed, the term ( l A)x 2 is an additive constant in 
(5.26), and it is irrelevant for the minimization of E given by (5.26) on the 
v 0 , v 1 plane; therefore, it will be omitted. Comparing the like coefficients 
of v 0 and v x on the right side of energy functions (5.25b) and (5.26) results 
in the following network parameters: 


w 0 i = wio = ~ 2 

1 

l Q ~ x ~ 2 

= 2x — 2 


(5.27) 


The resulting two-bit A/D converter network is the special case of the 
general network from Figure 5.4(a) and is designed according to specifi- 
cations (5.27). The network diagram is shown in Figure 5.6(a). The two 
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(a) (b) 

Figure 5.6 Two-bit A/D converter from Example 5.2: (a) using noninverting neurons with 
activation function f(net) and (b) using inverting neurons with activation function -f(net) and 
voltage sources producing / 0 and i\. 


current sources produce the biasing currents i 0 and j, and the weights are of 
-0.5 ft values. Let us note that the KCL equations (5.11) for nodes i = 0 
and i = 1 can be obtained by inspection of Figure 5.6(a) as follows: 


=x ~\ + (_2)v i ~ «o(" 2 + *>) 

d«i 

C\-^ = 2X-2 + (— 2)v 0 - «i(— 2 + g,) 


(5.28) 


As we have already mentioned, the last term of the expression for energy 
(5.14) vanishes for high-gain neurons since the products u i G l become negli- 
gibly small. This assumption remains valid for large-gain A values yielding 
small activation values u h unless the solution is very far from the square’s 
vertices. Under this assumption, the converter diagram from Figure 5.6(a) 
can be redrawn as shown in Figure 5.6(b). An explicit assumption made for 
this figure is that u 0 = u x = 0, so that currents / 0 and i x as in (5.27) can be 
produced by a voltage source having voltage x combined with the reference 
voltage source, which delivers the voltage of value -IV. 

An additional change introduced in Figure 5.6(b) compared to Figure 
5.6(a) for the sake of practicality has been the replacement of neurons with 
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activation function f(net) by neurons with its negative activation function 
-f(net). Neurons with inverting properties can be called inverting neurons. 
Their outputs are now denoted v 0 and Pj. As a result of this modification, the 
previously negative conductances w 01 and w 10 of value —2 become positive, 
as desired. It is left to the reader to verify that Equations (5.28) now can 
be rewritten as follows 


dun 1 

Cq-^j- = x - - + 2vj - u 0 ( 2 + g 0 ) 

= 2x — 2 + 2v 0 — Mj( 2 + gj) 

Equation (5.29a) can be arranged in matrix form as in (5.12a): 


(5.29a) 


C 0 

0 


' dM 0 " 






~åt 


0 

2 


V 

åu x 


2 

0 


.Vl. 

.~åt . 


— 

2 + ^o 
0 


0 

2 + £i 


"°1 + 
M] J 


(5.29b) 
1 ‘ 

*~2 
2x — 2 


The condition Wy > 0 ensures the realizability of the network using resistive 
interconnections of values Ry. The conductance matrix W now has positive 
valued elements Wy = 1 / Ry. m 


TRANSIENT RESPONSE 
OF CONTINUOUS-TIME 
NETWORKS 

Equation (5.11) or (5.12) describes the discussed nonlinear dynamical sys- 
tem in the form of its State equations. Because analytical methods for solving 
such equations in closed form do not exist, suitable numerical approaches need 
to be devised for obtaining transients \(t) and the stationary solution thereof 
lim^^ \(t). 

A number of numerical integration methods allow for computing the dy- 
namic solution for v(/) starting at initial condition v(r 0 ). Also, steady-state so- 
lutions for the stationary points may be obtained by a number of methods for 
solving static nonlinear equations. In this section we will address only those 
among the many numerical approaches that are particularly useful in providing 
additional insight into network dynamics pattem. 

Let us see first how the vector field method can be used to illustrate the 
real-time phenomena in networks with finite gain neurons. The method allows 
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the transients of the nonlinear dynamical system to be captured. The vector field 
obtained by using this method involves all possible network trajectories in E n 
space. 

To obtain the vector field expression of the network of interest, Eq. (5.11) 
needs to be normalized using the Q value as a divisor. Thus, the description of 
the system is given by n equations below: 


dw, 
d t 


7T ( h ~ + X WgVj 


C, 


j= i 


for i = 1, 2, . . . , n 


(5.30) 


Our vector field analysis objective is to transform the description of the system, 
originally as in (5.30), to the following form: 


UV • 

—i = ttø[v(0L for i = 1, 2 n (5.31) 

dt 

where i//, are functions that should be found. These functions are vector field com- 
ponents. Subsequently, after discretizing differential equations [Equations (5.31)], 

values ij>\ (v), i// 2 (v), iA„(v) are numerically computed at each point of the 

space v. As a result, we obtain the vector field diagram, which clearly indicates 
complete trajectories of the system’s dynamics and shows how it evolves in 
time during transients. While the solution is somewhat more qualitative and the 
produced vectors contain no explicit time component, the vector field provides 
complete information about basins of attractions for any given initial point, as 
well as about the stationary solutions. 

To be able to use the vector field method, we note that the specific form of 
the activation function should be assumed in order to produce Equation (5.31). 
For the discussion below it has been chosen as 


v = /(«) 


1 

1 + exp(— A m) 


Simple differentiation of (5.32a) leads to 


(5.32a) 


dv 

du 


A(v - v 2 ) 


(5.32b) 


The increments of v, as functions of increments of u { can now be expressed as 


dv,- = A(v,- - v 2 )dH, 


(5.32c) 


Inserting the result of Equation (5.32c) into (5.30) leads to 


dv; 

d t 



G/" '(v,) + I 


j= 1 


W ij V j 


for i = 1, 2, 


(5.33) 


Equation (5.33) is a specific form of the general vector field expression (5.31) for 
which we have been looking. Vector field components ^ are equal to the right 
side of (5.33). To trace the movement of the system within the output space, the 
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components of tf/ must be computed for the entire cube v e [0, 1] as follows: 


A(v,- ~ v-) 


« h(y) = 


- Gjf ~ 1 (v,) + 


C, 



for i = 1, 2, ...» n (5.34) 


It is interesting to observe that the trajectories depend on the seleeted Å values, 
if they are finite, and also on the values of capacitances C,. In addition to these 
factors, the steady State, or equilibrium solution may be reached along one of 
the many trajectories dependent on the initial conditions. The example below 
will provide more insight into the transient performance of a network with finite 
A i values. 


EXAMPLE 5.3 

In this example we will evaluate stationary points and discuss the time- 
domain behavior of the continuous-time example two-bit A/D neural con- 
verter designed in Example 5.2. In particular, we will trace the behavior 
of the converter output versus time and inspect the energy minimization 
property during the evolution of the system. 

Substituting the weights and current values as calculated in (5.27) we 
can express the truncated energy function given by (5.25b) for very high 
gains A; as follows: 

E=~\ 1 2 (-2 )v,v,. - (x ~ I) v 0 - (2x - 2)v, (5.35a) 

2 i=0 7=0 V 2/ 

i*j 

Expanding the double sum and rearranging terms leads to the value for E: 

E = 2v 0 V! 4- — + 2v 1 — x(v 0 + 2vj) (5.35b) 

Figure 1.10 displays two cases of the truncated energy function of Equation 
(5.35) for analog input x = 0 and x = 1. As shown in the figure, the only 
stationary points are appropriate minima, v 0 = = 0 for x = 0, and v 0 = 1 

and Vj = 0 for x = 1. Figure 5.7(a) shows the energy function for x = 0.5 
and it can be seen that, in contrast to previous cases, the minimum of the 
energy function exists for 0 < v 0 < l,Vj = 0. It is thus spread over the 
entire range of output variable v 0 . Indeed, an absence of a definite response 
by the network is caused by the faet that no digital conversion of the analog 
input x = 0.5 exists that would minimize the conversion error (5.24a). 

Another interesting case of energy function behavior occurs for 
1 < x < 2. The energy funetions (5.35) are illustrated for x =, 1.25 and 
x ~ 1.5 on Figures 5.7(b) and (c), respectively. Inspection of both funetions 
reveals that they both have (1) a desirable minimum at v 0 = 1 and v x = 0, 
(2) an undesirable minimum at v 0 = 0 and v 1 = 1, and (3) a saddle point 
somewhere inside the unity square. 



Figure 5.7 Stationary points of the truncated energy function for the two-bit A/D converter of 
Example 5.3: (a) energy function (x = 0.5), (b) energy function (x = 1.25), (c) energy function 
(x = 1.5), and (d) saddle point locus (0 < x < 2.5). 
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The discussion below focuses on more detailed analysis of the energy 
function so that the equilibrium points of the system can be investigated 
based on the evaluation of the energy function. This should provide us with 
much insight into the behavior of the system, but even more importantly, 
will perhaps eliminate the need for solution of differential equations (5.11) 
and replace it with inspection of the energy function minima. 

Let us evaluate the truncated energy function (5.35b). The energy gra- 
dient V£(v) can be computed from (5.35b) and equated to zero for the fixed 
value of x to determine possible high-gain minima at v*: 


V£(v) = 


2v > + |“* 

2v 0 + 2 — 2x 


(5.36a) 


To find the type of stationary points of the network, it is helpful to see that 
the Hessian matrix H of the energy function (5.35b) becomes 


H = V 2 £(v) = 


'0 2 ' 
2 0 


(5.36b) 


Result (5.36b) indicates that the Hessian matrix is not positive definite. The 
energy function has therefore no unconstrained minima. Stationary points v* 
may still be maxima and /or saddles. We can also see that Hessian matrix 
determinants are 


det// n = 0 


det H 12 = 


'0 2 
2 0 


= -4 


Thus, the Hessian matrix is not negative definite either. Thus, whatever 
solutions v* are identified within the cube (0, 1) such that 


V£(v*) = 0 (5.36c) 

will be the saddle points of the E(v) function. This coincides with our 
conclusions related to Equation (5.22) and (5.23) that the minima of the 
truncated energy function (5.35b) are at cube vertices, and the saddle points 
may be located anywhere inside it. 

The solution for a saddle point of the energy function (5.35b) results 
directly from (5.36a) and (5.36c) as 

v, = \v 0 + 1 (5.37) 

The saddle point locus is shown in Figure 5.7(d). The saddle point exists 
within the unity square 0 < v ( < 1, for i = 0, 1, and only for 1 < x < 2. 
The cases 0 < x < 1 and 2 < x < 3.5 are characterized by a monotonic 
energy function and the absence of a saddle point within the square of 
interest. However, there exists a saddle point outside the square and thus 
outside the domain of definition of vector v. wm 
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Let us see how the study of the energy function and the properties of 
the converter’s vector field can be merged to provide insight into the network 
behavior. Examples of selected vector field diagrams produced from Equation 
(5.34) for the discussed converter are shown in Figure 5.8. The figure illustrates 
four different cases of convergence to the equilibrium point for selected gain 
values of A = 10 and 100. The remaining convergence parameters of interest are 
the input conductances g t of the neurons and the capacitance ratio C x /C 0 . The 
convergence has been evaluated for inputs x = 0 [Figures 5.8(a) and (b)] and 
x = 1.8 [Figures 5.8(c) and (d)]. The energy function contours, or equipotential 
lines, have also been marked as background lines. The energy contour maps 
displayed are for the case of very high-gain neurons and a truncated energy 
expression as in (5.35b) (Zurada, Kang, and Aronhime 1990). 

Unless the terms u 0 G 0 and u x G x are excessive, the system evolves in time 
toward lower values of the truncated energy function as shown. Otherwise, the 
system minimizes the complete energy function as in (5.14), involving therefore 
the third term. In any of the cases, the evolution of the system in time is not 
exactly in the energy function negative gradient direction. It can be seen that for a 
high value of gain A and small input conductances g t of neurons (note that these 
are the conditions which make a network free from parasitic phenomena) the 
equilibrium solution approaches one of the comers shown in Figures 5.8(a) and 
(c). As shown in Figure 5.8(b), for large input conductances g 0 = g x = 10 H -1 , 
the convergence does not end at any of the comers. The truncated energy function 
contours do not exactly describe now the evolution of the network in the case of 
large g^ terms. The stationary point M in Figure 5.8(b) is not in the truncated 
energy function minimum; however, it is a minimum of the complete energy 
function involving its last term. It is also the endpoint of all relevant trajectories 
indicating the dynamics of the network. Point M is the minimum of the complete 
energy function (5.14), which for this case becomes 

E m = E + E 3 (5.38a) 

Energy E is given by (5.35b) and the term E ^ can be obtained from (5.14) as 

£3 = i Gj / /(A,z ) _1 dz (5.38b) 

1=0 J 1/2 

where the inverse of the activation function (5.32a) can be expressed 

This yields the explicit value of the third term of energy as 

£3 = I v [v< In v,- + (1 - v,) ln (1 - v,) + ln 2] (5.38c) 

i‘=0 A i 

It should be noted that the total energy function £tot now has minima within 
the cube. Indeed, when neurons are of finite gains A, then the minimum of the 
energy function must exist within the cube, since no walls, edges, or comers 



Figure 5.8 Selected vector field method solutions for dynamic behavior of two-bit A/D con- 
verter: (a) x = 0, A = 100, Ci /Cq = 1, g = 1, (b) x = 0, A = 10, Ci /Co = 10, g = 10, 
(c) x = 1.8, A = 100, Ci /Cq = 1, g = 0.1, and (d) x = 1.8, A = 10, Ci /C 0 = 10, g = 1. 
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are the domain of vector v. This is in contrast to the case of the truncated 
energy function for which minima have been confined to vertices and edges of 
the cube. We can see that the total energy (5.38c) has no truncation error and 
more realistically characterizes the properties of actual networks for which Å is 
of finite value. The reader may refer to Problem P5.19 for more illustrations 
related to the total energy function. 

Let us see how we can utilize the vector field trajectories to visualize the 
phenomena occurring in the network. By comparing the vector fields of Figures 
5.8(a) and (b), we can see that the trajectories become horizontally skewed for 
large C l /C 0 . This condition promotes movement of the system with slowly vary- 
ing Vj and faster varying v 0 . This asymmetry of motion is due to the faet that 
C x holds the majority of charge and the voltage u x has to change more slowly 
than u 0 in this otherwise symmetric network. 

In addition, the vector field provides valuable insight for the case of an 
energy function saddle, as has been shown in Figures 5.8(c) and (d). The analog 
input which needs to be converted is of value x = 1.8. It can be computed from 
(5.36a) that the saddle point S denoted as v* is at (0.8, 0.65). It can also be 
seen that the trajectories indicate that either of the solutions v 0 = 0 and v x = 1 
or v 0 = 1 and v x = 0 can be reached by the network depending on the ini- 
tial condition selected. Comparison of convergence with zero initial conditions 
at v = [0.5 0.5] indicates that the right solution for converting the analog 

input x = 1.8 is reached in the case of equal capacitances C x and C 0 in Figure 
5.8(a). The case shown in Figure 5.8(d) depicts the incorrect solution due to the 
horizontally biased movement of the system toward the right. 

In conclusion, the vector field method provides detailed insight into the 
transients and stability conditions of the actual network. Although the method 
can be illustrated by trajectories and provides insight only for cases of networks 
with up to three neurons, it can be applied for any value of n. In addition, one 
of the many numerical integration formula can be used for numerical analysis of 
dynamic performance of continuous-time networks (DiZitti et al. 1989). Selected 
exercises at the end of the chapter are devoted to this approach. 


5.5 

RELAXATION MODELING IN 
SINGLE-LAYER FEEDBACK 
NETWORKS 

The discussion in the preceding sections of this chapter has shown that the 
gradient-type single-layer network converges, in time, to one of the minima of 
E(y) located within the hypercube [—1,1]. This assumption is valid for bipolar 
continuous neurons with outputs defining the components of /z-dimensional vector 
v. Whether simulated, or reached in actual gradient-type hardware network, the 
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solution may or may not be satisfactory. The solution may not even be the desired 
one due to the existence of a much better solution. 

Let us devise an efficient numerical approach for finding the solution of 
differential equations (5.12). As stated before, the solution of differential equa- 
tions (5.12) can be replaced by solving the purely algebraic equation (5.20b) with 
the additional nonlinear mapping condition (5.12b). Let us notice that (5.20b) is 
equivalent to n algebraic equations written as 


n 


1 WijVj + = GjUj, 

j= i 


for i = 1,2,...,« 


(5.39a) 


Using (5.12b), the above expression can be rewritten in the form involving the 
nonlinear mapping v = f(u) as follows: 



’ 1 / n V 

II 



for i =1, 2, ..., n 


(5.39b) 


The numerical solution of (5.39b) for v, can be modeled by the fixed-point 
iteration method as follows: 


k+1 _ 


= / 


o, + 


for i = 1,2, 


(5.40a) 


where superscript k denotes the index of numerical recursion, k = 0, 1, 

Equation (5.40a) can be briefly rewritten in the vector form as 


.£+1 


G~‘(Wv* + i) 


(5.40b) 


Equation (5.40b) represents the computational model for recursive numerical 
calculations of vector v* +1 . First, an initial output vector v° is submitted to the 
network to initialize it, then the response v 1 is computed and is used to replace 
the value v° on the right side of (5.40b). The algorithm of Equations (5.40) is 
referred to in literature as the relaxation, or method of successive approximation, 
algorithm . As we can see, the algorithm is static, and it represents the solution 
of a system without its dynamic components. Indeed, it is remarkable that the 
solution (5.40b) does not involve network capacitances. Although the relaxation 
algorithm is computationally simple, it does not provide any insight into the 
network dynamics. In addition, the algorithm shown below is numerically stable 
only under certain conditions. 

The recursive computation of v* +1 using fixed-point iterations (5.40) is 
convergent only under certain conditions, which are discussed in more detail and 
derived in Zurada and Shen (1990). A sufficient condition for convergence of 
the relaxation algorithm is 


A < min ■ 


Si 

Gj_ 

\ 

t” 

w u 

j 


for j = 1, 2 j ± i 


(5.40c) 


It can be seen from (5.40c) that to assure unconditional convergence of the 
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(a) 

Figure 5.9a Relaxation algorithm using fixed-point iteration (5.40) for two-bit A/D converter, 
x = 1.7: (a) A = 3.5, stable numerical solution. 


algorithm, A must be kept below an upper bound value on the right side of 
the inequality. An illustration of the relaxation algorithm for the two-bit A/D 
converter is shown in Figure 5.9 for the input x — 1.7. Figure 5.9(a) shows 
stable recursive calculations using (5.40) for Å = 3.5. The recursive calcula- 
tions become oscillatory, however, for the A value of 5, as shown in Figure 
5.9(b). Note that both cases from Figure 5.9 originate at zero initial conditions 
of v° = [ 0.5 0.5 ] . This corresponds to the network without an initial energy 

and starting at a neutral initial point. Specifically, the network capacitances hold 
no initial charge and provide no initial bias. 

In summary, the vector v(f) converges, in time, to one of the minima of E(\) 
constrained to within the [0, 1] cube, and transients are dependent on the initial 
condition v(0) = f[u(0)]. The system setties there according to its dynamics, and 
the capacitors and other components determine the actual rate of convergence 
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Figure 5.9b Relaxation algorithm using fixed-point iteration (5.40) for two-bit A/D converter, 
x = 1 .7 (continued): (b) A = 5, unstable numerical solution. 


toward the minimum of E(v). The computational relaxation model (5.38), al- 
though entirely based on the resistive portion of the network, captures one of the 
time-domain, or transient, solutions at which the gradient vector vanishes. 

It should also be noted that the relaxation algorithm, or any other static al- 
gorithm that solves the equation (5.20b) based on zeroing of the gradient vector, 
should be used for finding stationary points rather than for tinding minima only. 
Indeed, since VE(v) = 0 both at saddle points and at minima, the numerical 
solutions for cases as shown in Figure 5.9 also produce the saddle point S. 

Similar results can be obtained from analysis of single-layer feedback net- 
works using the Newton-Raphson algorithm to solve (5.20b). The Newton- 
Raphson algorithm is unconditionally stable; however, it is very sensitive to 
the initial conditions. More insight into application of this algorithm for neural 
network numerical solutions is available to the reader in Problem P5.18. 
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5.6 ■■ 

EXAMPLE SOLUTIONS OF 
OPTIMIZATION PROBLEMS 

In this section, two comprehensive examples of continuous-time single-layer 
feedback networks will be covered. The discussion will emphasize understanding 
and practical integration of most of the concepts introduced earlier in the chapter. 
Both examples present the solution of optimization problems modeled with fully 
coupled single-layer network. The first example covers the design of an electronic 
network that performs the conversion of a multidimensional analog signal into 
a four-dimensional binary unipolar vector using the least squared error criterion. 
The presentation of the first example involves all stages of the design, and is 
followed by the detailed performance study of the network. The second example 
approaches a classical and more general optimization problem solution known in 
technical literature as the traveling salesman problem. The presentation of this 
second example focuses on formulating the most suitable objective function that 
would need to be substituted for an energy function. 


Summing Network With Digital Outputs 

The network is based on the concept presented earlier in Example 4.2. The 
adder /converter discussed in this section and the simple two-bit A/D converter 
in Example 4.2 belong to the same class of optimization networks. They both 
attempt to solve specific optimization problems of error minimization between 
analog input values and their digital output representation using Boolean vectors. 
Note that the problem essentially reduces to finding a mapping of various input 
signals into an n-dimensional binary vector such that the mapping objective func- 
tion, defined as quadratic error, is minimized. To design the network, appropriate 
responses need to be encoded as its equilibrium States. 

Assume that the Circuit shown in Figure 5.4 needs to compute the sum 
of N analog voltages x k with corresponding weighting coefficients a k . A digital 
representation of the output sum using n bits is required, thus v, = 0, 1, for i = 0, 
1, ..., n — 1. Let us momentarily assume that the accurate analog sum of the 
signals is x, thus: 

* = S a k x k (5.41) 

k=0 

This value has to be approximated by a binary n-component vector. The compu- 
tational energy that needs to be minimized for this Circuit behaving as an n-bit 
A/D converter can be expressed in terms of the squared error 



E, 


(5.42a) 
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Expansion of the sum indicates that (5.42a) contains square terms (v,2') 2 , thus 
making the w n terms equal to 2 2 ', instead of equal to zero as required. Similarly, 
as in Example 5.2, a supplemental energy term E a (5.42b) should be added that, 
in addition to eliminating diagonal entries of the weight matrix W, will have 
minima at v, equal to 0 or 1. 

E a = E^tvr 1) (5.42b) 

2 i=0 

Let us sum E c and E a and use the truncated energy function (5.14) which is 
allowed in case of high-gain neurons. We can now require equality between the 
specific energy function value on the left side of Equation (5.43) and its general 
form shown on the right side of (5.43): 

1 / n- 1 A 2 1 n- 1 1 

x (x ~ I v,-2‘ -ri 2 2 'v,(v i - 1) = - -v'Wv - i'v (5.43) 

2 \ i=0 / 2 i~ o 2 

Comparing coefficients on both sides of (5.43) yields the conductance matrix and 
bias current vector entries of the designed neural network as follows 

w i} = —2 i+J (5.44a) 

i. = -2 2 '- 1 + 2‘x (5.44b) 

In the case of a four-bit A/D converter, the results of Equations (5.44) can be 
written in matrix form as follows: 


■ 0 

2 

4 

8' 


i i 

2~ X 

2-2x 

2 

0 

8 

16 

i — — 

4 

8 

0 

32 

9 * 

8 -4x 

8 

16 

32 

0. 





.32 — 8jc_ 


(5.44c) 


Due to the simplification introduced in (5.41), these are only preliminary an- 
swers. To complete the design and to obtain the full diagram of the summing 
and converting circuit, the value of x as in (5.41) should be plugged into the left 
side of formula (5.43). This yields the modified energy function in the following 
form: 


E 


1 (N - 1 n - 1 A 2 1 n-\ _ 

r ( I a k x k - I v,2' "ri 2 2, v I (v, - 1) 
Z \k=0 i= 0 J 2 1=0 


(5.45a) 


Rearranging the above expression results in the energy function for the summing 
circuit being equal to 


1 n~\ n - 1 

I "2 ,+ Vy 

2 i = o ;=o i=o 

i±j 


n-\ ( , .N~ 1 

I ( — 2 2/_1 + 2 l I a k x k )v i 


+ 




k = 0 
2 


(5.45b) 
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Comparing Equation (5.45b) with the right side of Equation (5.43) allows the 
calculation of the conductance matrix elements w t j and the neuron input currents 
i t of network S as 

w tJ = -2 i+i (5.46a) 

■ = +2‘ N i'a k x k (5.46b) 

k=Q 

Figure 5.10 shows the resulting network with appropriate conductance values 
labeled. The network consists of two parts. The bottom part replicates the fully 
coupled single-layer network of Figure 5.4a. This is a “neural” part of the net- 
work with synaptic weights as in (5.46a). Note that the originally negative signs 
of the conductance values have been absorbed by the feedback signal derived 
from inverted neuron outputs. The top part of the network denoted S produces 
appropriate bias currents by computing the expression (5.41) as indicated by the 
sum on the right side of (5.46b). Conductances connect voltage sources Xj and 
a voltage source —IV with inputs of neurons to produce current according to 
formula (5.46b). 

The approach presented in the preceding paragraphs can be extended to 
the case of binary addition. For binary coded values of signals x and y, the 
corresponding energy function similar to that obtained in (5.45a) becomes 

1 //i-l n - 1 n \ ^ 1 n 

E = t I 2 k x k + I 2\ - I v,2‘ - - S 2 2, v,(v i - 1) (5.47) 

^ U=0 k = 0 /= 0 J l i=o 

Rearrangement of (5.47) leads to an expression for the values w t j identical to 
(5.46a). The new conductance values of network S performing the binary addition 
are now the only difference between the originally considered converting Circuit 
and the joint summing and converting circuit. The input currents to the neurons 
are specified by the following expression: 

i, = — 2 2 ' -1 + 2‘ ("£ 2 k x k + "i 2 O , for / = 0, l, n (5.48) 

\k = 0 fc=0 / 

The binary adder network S generating the bias current vector i consists now 
of the total of (2 N + 1) (n + 1) conductances. The “neural” part of the adder 
remains unchanged and identical as in previous cases of an A/D converter and 
an analog signal adder. 

To illustrate the theoretical considerations involved in this design, the per- 
formance of an electronic network model has been simulated. The convergence 
of transients for the hardware model has been evaluated for nine different initial 
conditions. The neurons have been designed using NMOS transistors (Zurada and 
Kang 1989). For the simple two-dimensional case, contour maps of truncated 
energy functions, as in (5.35b), have been computed and convergence versus 
time evaluated for x = 0, 1, 2, and 3. The resulting trajectories obtained from 
an electronic circuit simulator program SPICE2G1.6 are shown in Figure 5.11. 
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(a) 


(b) 


Figure 5.1 la, b Equipotential lines and simulated transients for two-bit adder for different inputs 
(• denotes initial point, □ denotes zero initial conditions point): (a) x = 0, (b) x = 1 . 


The distance between each of the two neighboring energy contour lines shown 
is equal to 1/16 of the energy difference within the cube. As expected, none of 
the energy functions has local minima within the cube. Transitions in time take 
place toward one of the constrained minima and they are generally not in the 
negative gradient direction. 

In the cases of x = 1.3 and 1.8, the corresponding energy functions have 
a saddle point and two minima at 01 and 10, respectively. The convergence of 
transients is to the correct minimum in both cases for C 0 = C x if zero initial 
conditions are chosen. For other choices of initial conditions, the transients may 
decay while yielding an incorrect final response at the output of this neural-type 
A/D converter. The erroneous response case is depicted in Figures 5.11(e) and 
(f), which show three trajectories ending up at the incorrect minimum. 

For small slopes of the energy surface, the convergence is expected to be 
slow. This has actually been observed in Figures 5.11(b) and (c) for curves 
leaving the initial points most distant from the correct solutions, which are 01 
and 10, respectively. The trajectories indicate rather slowly varying transients 
near v x = 0.25 [Figure 5.11(b)] and v 2 = 0.75 [Figure 5. 11 (c)], respectively. 
Slight discontinuities on all curves of Figure 5.11 are also noticeable. They are 
due to the shape of the voltage transfer characteristics near u — 0 for the actual 
electronic neuron models. 

The results of transient simulations at the Circuit level using the SPICE2G1.6 
program for a four-bit summing network with zero initial conditions have been 
correct for 145 of 155 simulated cases. Input x applied to a four-bit summing 
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£„*.=-0.8 (■) 


£ mta =-1.6(H) 


(e) 


(f) 


Figure 5.11c-f Equipotential lines and simulated transients for two-bit adder for different inputs 
(• denotes initial point, □ denotes zero initial conditions point) (continued): (c) x = 2, (d) x = 3, 
(e) x = 1.3, and (f) x = 1.8. 


network is varied in the range between O and 15.5 with a step of 0.1. In all 10 
cases of erroneous convergence, however, the settled binary output is adjacent 
to the correct answer, thus making the resulting response error bounded. Results 
of the transient Circuit analysis and subsequent thresholding of v are shown in 
Table 5.1. 
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TABLE 5.1 

Simulation results for input x from the interval 
[0, 15.5], four-bit adder. 


Input 

Integer Part 





Input Decimal Part 





0 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

0.7 

0.8 

0.9 

0 

0 

0 

0 

0 

0 

0.5 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2.5 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

4 

4 

4 

4 

4 

4 

4 

4.5 

5 

5 

5 

5 

5 

5 

5 

5 

5 

5 

6 

6 

6 

6 

6 

6 

6 

6 

6 

6 

6 

6.5 

7 

7 

7 

7 

7 

7 

7 

7 

7 

7 

8 

8 

8 

8 

8 

8 

8 

8 

8 

8 

8 

8.5 

9 

9 

9 

9 

9 

9 

9 

9 

9 

9 

9 

9 

10 

10 

10 

10 

10 

10 

10 

10 

10 

10.5 

11 

11 

11 

11 

11 

11 

11 

12 

12 

12 

12 

12 

12 

12 

12 

12 

12 

12 

12 

12 

12 

12.5 

13 

13 

13 

13 

13 

13 

13 

14 

14 

14 

14 

14 

14 

14 

14 

14 

14 

14 

14 

14 

14 

14.5 

15 

15 

15 

15 

15 

15 

15 

15 

15 

15 







Interesting results have been observed for simulation of the case when the 
integer answer for the sum, or A/D conversion problem solution, does not exist 
because it is exactly equidistant from the adjacent integers. The v 0 bit has re- 
mained undetermined and has settled very close to 0.5 for 8 of 15 such cases. 
Evaluation of the energy value (5.14) shows flat minima spread between adja- 
cent integers. The case is reminiscent to the one depicted in Figure 5.7(a) for 
the truncated energy function of the two-neuron network. 

Sample actual transients illustrating such cases are shown in Figure 5.12, 
which shows for case x = 0.5 that the truncated energy function is monotonic, 
it has no saddle point within the cube, and the least significant output bit of the 
network converges to near 0.5. On the other side if x = 1.5, the convergence 
is somewhat erroneous as shown in Figure 5.12(b), because the saddle point at 
v 0 = Vj = 0.5 divides symmetrically the two monotonic subsurfaces of the error 
surface E. This case was depicted earlier in Figure 5.7(c) showing the truncated 
energy surface for x = 1.5. 

Inspection of corresponding energy surfaces has also shown that for the 
erroneous results listed in Table 5.1 for the four-dimensional case, the network 
outputs have converged to erroneous shallow minima. It has also been observed 
that the convergence to correct minima is usually more reliable when initiated 
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(a) 


(b) 


Figure 5.12 Equipotential lines and simulated transients for two-bit adder for different inputs 
(• denotes initial points, Qdenotes zero initial conditions point): (a) x = 0.5 and (b) x = 1.5. 


from a higher initial energy level. It has also been found that the network un- 
mistakably rejected all obviously incorrect suggestions. Additions with negative 
results have all been rounded to the lowest sum 00, . . . , 0. Additions with 
overflow sums have been rounded by the network to the highest sum 11, . . . , 1. 


Minimization of the Traveling Salesman 
Tour Length 

The traveling salesman problem of minimization of the tour length through 
a number of cities with only one visit in each city is one of the classic opti- 
mization problems. The objective of the problem is to find a closed tour through 
n cities such that the tour length is minimized. The problem is NP-complete 
(nondeterministic polynomial time). 

Typically, an optimization problem of size n may have many possible so- 
lutions. However, only one of the solutions minimizes the cost, or error, func- 
tion. Optimization problems of this kind are called combinational optimization 
problems . They are often divided into classes according to the time needed to 
solve them. 

If an algorithm exists that is able to solve the problem in a time that in- 
creases polynomially with the size n of the problem, the problem is said to 
be polynomial ( P ). NP-complete problems are a class within the P class. Such 
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Figure 5.13 Traveling salesman problem example solution: (a) shortest path and (b) city/ 
position matrix. 


problems can be tested in polynomial time whether or not a guessed solution 
of the problem is correct. To solve an NP problem in practice, it usually takes 
time of order exp(n). Thus, the traveling salesman problem solution using the 
conventional combinational optimization approach grows exponentially with the 
number of cities n. We consider a solution for the problem using continuous-time 
gradient-type neural networks (Hopfield and Tank 1985). 

An example of the cities and the proposed tour is shown in Figure 5.13(a) 
for n = 5. The distances between cities are assumed to be constant for a given 
set of visited cities. The reader can verify that there are ni tours that can be 
produced in an n city problem. Among them, 2 n closed paths are originating at 
different n cities and going in both directions; therefore, they are of identical 
lengths. Thus, there are (n— 1)! / 2 distinet paths that have to be evaluated as far 
as total tour length minimization is concemed. 

A network consisting of n 2 unipolar continuous neurons can be designed to 
solve this problem. Let us arrange the neurons in an n X n array. The suggested 
representation for the n X n neuron array can be arranged in the form of a matrix, 
where the fth row in the n X n matrix corresponds to a city y y -, and the fth 
column in the matrix corresponds to a city position We thus have city rows 
and position columns. One example solution of the problem is shown in Figure 
5.13(b) for n = 5. It has the form of a city /position matrix with 0, 1 entries 
representing the best tour shown in Figure 5.13(a). Since each city can only be 
visited once and no simultaneous visits are allowed, the solution matrix can have 
only a single 1 in each column and a single 1 in each row. The neuron tumed 
on, or with output 1, in the square array of neurons indicates a particular position 
of a particular city. 
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Note that the example of Figure 5.13 introduced specific notation. The row 
subscript corresponds to a city name, and the column subscript indicates a po- 
sition in the tour. Using the correspondence between the neuron’s array and the 
solution matrix, we may say that a valid solution of the problem would have 
only a single neuron in each column and only a single neuron in each row on. 
Thus, a total of n of n 2 neurons is distinctly tumed on. Accordingly, we will 
assume high-gain neurons for the network modeling this problem. Note that the 
truncated energy function as in (5.14), with the third term omitted for simplicity, 
can be rewritten as follows: 

E = “ ~ XX w xi,Yj v xi v Yj ~ X ixi v xi (5-^9) 

l Xi Yj Xi 

Let us observe that the domain of vectors v where the function E is minimized 
is the 2 nXn comers of the n 2 -dimensional hypercube [0, 1], 

Let us attempt to find the suitable expression for the energy function solving 
this problem (Hopfield and Tank 1985). Our objective is to find weights and cur- 
rents that are coefficients of the quadratic form of the energy function (5.49) such 
that the objective function is minimized. The goal of the optimization is to select 
the shortest tours. Also, the matrix of responses has to be penalized for more 
than a single “on” neuron per row and per column. In addition, the matrix has to 
be penalized for trivial solutions of all entries being zero. Let us quantitatively 
express each of the optimization conditions just spelled out qualitatively by using 
four following terms of the energy function, each fulfilling a separate role: 


E\ ~ ^ X X X v xi v xj> 

X i j 

for i + j 

(5.50a) 

E 2 = # XXX 

for X # Y 

(5.50b) 

i X Y 
/ 

\2 


£ 

O 

II 

- ^ 


(5.50c) 

^XXX^xr v x/ ( v y,/+ 1 + v yj- i) * X ^ Y 

(5.50d) 


Let us discuss the term E u as in (5.50a), of the energy function. A matrix 
with 0 and 1 entries has no more than one 1 in each row if and only if all possible 
column-by-column dot products within the matrix are zero. The double intemal 
sum of (5.50a) expresses the N — 1 dot products of a column having position X 
times the remaining columns. Performing the leftmost summation operation (in 
5.50a) yields the required sum of all column dot products, which ideally should 
all be zero. The zero value of £j would guarantee that each city will be visited 
only once. Note that the penalty will progress and become larger if more than a 
single 1 is contained in one or more rows. 

The energy term E 2 proposed as in (5.50b) can be substantiated in a similar 
way, and it contains a progressive penalty for two or more ones in each column 
of the matrix. This penalty takes a form of positive value for this term of the 
energy function and is for simultaneous visits in more than a single city. The 
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justification for this term is similar to that above for E u however, it applies to 
controlling the number of ones within columns of the city /position matrix. 

Note that penalties E x and E 2 are also of zero value for trivial solutions 
containing no ones, or less ones than required within the city /position matrix. 
Therefore the term £ 3 , as in (5.50c), is required to ensure that the matrix does 
not simply contain all zeroes. In addition, this term penalizes the objective func- 
tion for having more or less than n ones in the matrix. The energy function 
Ei + E 2 + £ 3 has minima of value 0 for all matrices that have exactly one 1 
per row per column, and all other responses of neurons produce higher energy 
values. Minimization of the energy function E x + E 2 + £ 3 ensures, or better, 
favors, that one of the valid tours can be generated. 

The discussion of the energy function has focused on formulating constraints 
and has not yet involved the true goal of tour optimization. The term £ 4 , as in 
(4.50d), needs to be added and minimized for the shortest tours. Simply summing 
the distance of the tour is a natural choice for £ 4 , since minimizing the tour length 
is the goal. However, denoting the intercity distances as d XY , we want to make 
sure that only the distances between adjacent cities are counted. To include end 
effects like the adjacency of city n— 1 and 1, the subscripts for summations have 
to be understood as summed modulo n. This term is numerically equal to the 
length of the path of the tour. 

The resulting weight matrix and bias currents can now be obtained by equat- 
ing the sum of energies E x through E 4 specified by (5.50) with the total energy 
value as in (5.49). The weights computed in this problem are (Hopfield and 
Tank 1985): 

Wxijj = -2AS xr (l - fy) - 2Bfy(l - S xr ) 

— 2C — 2Ddxy(8jj +l + fy,_i) 1 ' 3) 

where fy is the Kronecker delta function defined as fy = 1, for / = j, and fy = 
0, for i # j. Positive constants A, B, C, and D are selected heuristically to build 
an appropriate sum of terms E x through E 4 . The constants are responsible for 
weighting the relative validity of each of the four penalty terms in the overall 
energy function £ to be minimized. The extemal bias currents are 

i Xi = 2 Cn (5.51b) 

Observe that the four separate terms of the weight matrix (5.51a) have been gen- 
erated by the four parts of the energy function as in Equations (5.50a) through 
(5.50d), respectively. The term E x leads to inhibitory (negative) connection of 
value —A within each row. The term £ 2 causes identical inhibitory connections 
within each column. The term £ 3 results in global inhibition provided by each 
weight, and the term £ 4 contains the city distances data weight contribution. 
Finally, the extemal input currents are set at an excitation bias level as in (5.5 lb). 

The problem formulated as shown above has been solved numerically for the 
continuous activation function defined in (2.4a) with A = 50, A = B - D = 250, 
and C = 100, for 10 s « < 30 (Hopfield and Tank 1985). The normalized city 
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maps were randomly generated within the unity square to initialize each com- 
putational experiment using a simulated neural network. The equations solved 
numerically were 


du 


% _ _u Xi 


dt 


2A X v X j 2 B X v Yi 

T j±i Y=tX 


2C I nj 2DX^xy( v r,/+i + v y,/-i) 


(5.52) 


where the time constant t was selected as 1. For n = 10, 50% of the trials pro- 
duced the two best paths among 181440 distinet paths. For n = 30, there exists 
about 4.4 X 1 0 30 possible paths. The simulations of the solutions for this size of 
problem have routinely produced paths shorter than 1. Statistical evaluation of 
all paths shows that there are only 10 8 paths shorter than 7. Although no optimal 
solution has been reported when using this approach, the bad path rejection ratio 
by the network has been impressive. In numerical experiments by Hopfield and 
Tank (1985), the bad path rejection ratio was between 10“ 23 and 10 -22 . 

There are practical limitations of the discussed approach. The method be- 
comes more difficult to apply for larger problems, typically larger than 10 cities. 
However, the example of the traveling salesman problem illustrates that a class 
of optimization tasks can be approached using the dynamical system model pre- 
sented in this chapter. Many other possible applications of the model have been 
reported in the literature. Typical problems solved have been of the resource 
allocation type, which are subject to different constraints. Job shop scheduling op- 
timization has been reported by Foo and Takefuji (1988). In another application, 
the concentrator-to-sites assignment problem was mapped onto the single-layer 
network model by associating each neuron with the hypothesis that a given site 
should be assigned a particular concentrator (Tagliarini and Page 1988). 

Fully coupled single-layer networks can be applied to a variety of other 
technical problems. Networks of this class are capable of handling optimal rout- 
ing of calls through a three-stage interconnection network (Melsa, Kenney, and 
Rohrs 1990). The solution uses a discrete neuron network that seeks a mini- 
mum of an energy funetion for cases in which an open path exists through the 
interconnection network. When no such path is available, the energy funetion 
minimization performed by the network terminates at a nuli State indicating that 
the interconnection network is blocked. 

Another successful application of energy minimizing networks is for routing 
communication traffic within a multinode network consisting of many nodes and 
links. The solution of such a problem requires finding the best of multilink paths 
for node-to-node traffic to minimize loss. The loss is represented by expected 
delay or some other traffic inconvenience. The minimization procedure has been 
implemented using a modification of the traveling salesman problem solution 
(Rauch and Winarske 1988). Reasonable convergence for a 16-node network has 
been reported for up to four links from origin to destination. 
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Other applications of this class of neural networks involve microelectronic 
Circuit module placement on the chip that minimizes the total interconnecting 
wire length (Sriram and Kang 1990). The two-dimensional problem has been 
decomposed into two coupled one-dimensional placement problems. The perfor- 
mance of the designed neural network on problems involving placement with up 
to 64 modules has been very encouraging, with the network being able to find 
the globally optimal or near-optimal solutions in many cases. 

Single-layer networks have also been successfully applied to general lin- 
ear and nonlinear programming tasks. Nonlinear programming is a basic tool 
in systems where a set of design parameters is optimized subject to inequality 
constraints. Both experimental and theoretical results have been presented for 
completely stable solutions of specific linear and nonlinear programming prob- 
lems (Kennedy and Chua 1988; Maa and Shanblatt 1989). An economic power 
dispatch problem has also been solved by using the linear programming network 
with linear equality and inequality constraints (Maa, Chin, and Shanblatt 1990). 
A linear programming problem formulation and solution is also demonstrated in 
Section 8.1. 

5.7 

CONCLUDING REMARKS 

In this chapter we have introduced the basic properties of single-layer 
feedback neural networks. Neurocomputing algorithms producing convergence 
toward solutions for both discrete-time and* continuous-time networks have been 
reviewed. The discussion emphasized the dynamical principles of single-layer 
continuous-time feedback networks. Also, several network design examples have 
been presented to illustrate the basic concepts of system evolution toward its 
attractors. Throughout the chapter, the concept of energy function has been 
stressed. This had two goals in mind: to demonstrate the inherent stability of the 
networks and the suitability of the networks for producing solutions of certain 
optimization tasks. 

In addition to many questions that have been answered about the perfor- 
mance of this neural network model, some problems still remain open. Let us 
raise some of the remaining open questions. The stationary solution reached in 
the time domain does not represent, in general, a global optimum solution to 
the problem of energy function minimization. This is due to the often highly 
complex shape of the multidimensional energy function E(\). This limitation is 
somewhat mitigated by the faet that global solutions of real large-scale mini- 
mization problems of E(v) are often mathematically very hard to track anyway, 
and for large-scale problems cannot be found with certainty. 

One of the difficult tasks faced by the designer at the beginning of the design 
process is the translation of the optimization problem into the energy function 
minimization. Here, the general energy function must be matched by the criterion 
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function which is specific for the problem being solved. Given the restrictions 
on the connectivity matrix imposed by the model, these functions may not match 
easily. Therefore, problem-specific energy functions are usually hard to find. We 
have been able to discuss several choices of E (\ ), but the general solution to the 
problem does not exist. The only energy form known is the general one stated as 
in (5.5) or (5.14). Following simplifying modification can often be applied here: 
If high-gain neurons are used, the third term of the energy expression can be 
skipped. Also, when no extemal bias input i is needed, the second energy term 
vanishes and network weights only need to be computed. 

It often takes much of the network designer’ s ingenuity to devise a mean- 
ingful energy function in the neurons output space v and to find its coefficients 
w t j and In faet, our ignorance conceming possibly existing energy functions 
that solve a number of optimization problems creates one of the bottlenecks that 
makes the use of the gradient-type networks not as simple. For certain classes 
of systems, such as associative memories, however, E(\) and w t j are easy to find 
in an explicit form. This will be discussed in the next chapter. 

As we have seen, the convergence toward an attractor being a solution for 
single-layer feedback neural networks corresponds to the transient behavior of 
a nonlinear dynamical system. Although the concept of a time constant is not 
direetly applicable to nonlinear networks, the transient analysis shows that the 
convergence of a neural network typically requires several time constants as 
defined for the network’s linear portion. This property is often reported based on 
numerous experiments both involving actual hardware and numerical simulations. 

Figure 5.14 illustrates the procedure for solving optimization problems us- 
ing single-layer feedback networks. Basic steps that need to be followed by the 
designer are summarized on the diagram. The problem modeling procedure ter- 
minates by implementing transients within the network and accepting the solution 
produced. In easy optimization tasks, each of the generated solutions is usually 
correct. This has been the case for A/D conversion networks for low n values. 
For harder optimization problems such as the traveling salesman problem for 
large n (n> 30), the produced solutions are often good but not strietly optimal. 
This is due to the presence of many local minima in /i-dimensional space which 
are responsible for trapping the evolution of the transients. It should be realized 
that this class of networks for large n values yields solutions which are not 
necessarily optimal but acceptable in a statistical sense, i.e., over a population 
of experiments rather than for a single transient event. 

An additional limitation existing within the networks discussed concems 
their capacity. There is a limit on the total number of correct solutions that can 
be stored and expected from the network as its output. Although this number 
cannot be set precisely, heuristic evaluations of gradient-type networks show that 
the number of encoded solution vectors that can be retrieved satisfactorily is only 
a small fraction of the total number of neurons n. The network capacity issues 
and others related to network performance will be discussed in more detail in 
the following chapter. 
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Figure 5.14 Flowchart of producing solutions of optimization problems using fully coupled feed- 
back networks. 


PROBLEMS 

Piease note that problems highlighted with an asterisk (*) are typically com- 
putationally intensive and the use of programs is advisable to solve them. 

P5.1 The weight matrix W for a network with bipolar discrete binary neurons 
is given as 

'0 1-1-1 -3' 

10 11-1 
w= -i i o 3 i n _1 
-11301 
-3 -1110 

Knowing that the thresholds and extemal inputs of neurons are zero, 
compute the values of energy for v = [ — 1 1 1 1 1 ] ? and v = 

[-1 -1 1 -1 - 1 ]'. ’ 
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P5.2 Figure P5.2 shows a discrete-time recurrent network with high-gain bipolar 
neurons. 

(a) Find the weight matrix of the network by inspecting the connections. 

(b) Analyze asynchronous updates from the following initial States: 



Assume that the updates take place in natural order starting with the 
first neuron, i.e., 1, 2, 

(c) Identify any stable equilibrium state(s) of the network based on part 

G>). 

(d) Compute the energy value at the stable equilibrium state(s) evaluated 
in part (c). (Use the truncated energy function.) 

P5.3* Assuming the weight matrix and other input conditions from Problem 
P5.1, compute the energy values for all 32 bipolar binary vectors (there 
are five energy levels here). Identify the potential attractors that may 
have been encoded in the system described by the specified matrix W 
by comparing the energy values at each of the [— 1, +1] cube vertices. 
Implement five sample asynchronous discrete-time transitions from high- 
to low-energy vertices. 

P5.4 The weight matrix W for a single-layer feedback network with three 
neurons is given as 

0 -I -f 

w= -i o i rr 1 

-i i o 

Calculate the gradient vector, V£(v), for the energy function in three- 
dimensional output space and its Hessian matrix, V 2 £'(v). Prove that the 
Hessian matrix is not positive definite (see the Appendix). 

P5.5 For the two-neuron continuous-time single-layer feedback network shown 
in Figure P5.5 with high-gain neurons, find the following: 

(a) State equations 

(b) the weight (conductance) matrix, W 

(c) the truncated energy function, E(\) 

(d) the gradient vector of the truncated energy function, V£(v). 
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Figure P5.2 Discrete-time recurrent network for analysis. 
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Figure P5.5 Electronic neural network for Problem P5.5. 


P5.6 For the continuous-time single-layer feedback network shown in Figure 
P5.6 with high-gain neurons, find the following: 

(a) State equations 

(b) the weight (conductance) matrix, W 

(c) the current vector, i 

(d) the truncated energy function, E(\) 

(e) the gradient vector of the truncated energy function, VE(y). 

P5.7 Find the weight matrix W for the four high-gain neuron network shown in 
Figure P5.7. The network uses inverting neurons and four positive-valued 
conductances of value 2 II ~ 1 each. Then find the energy values for outputs 
V], v 2 , and v 3 given as below: 


v, =[-1-1 1 1]', v 2 = [-1 -1 -1 1]', v 3 = [-1 -1 -1 -1]'. 


P5.8 Sketch the appropriate conductances connecting outputs of inverting and 
noninverting neurons in a network having the overall diagram as shown 
in Figure P5.7 knowing that 


W = 


0 

-1 

-1 


-1 

0 

1 


-1 

1 

o 


ir 


i = 0 


Evaluate the truncated energy function produced by the network for v ( = 
±1, for i = 1, 2, 3, and identify possible attractors of the discrete-time 
network with high-gain neurons. 
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Figure P5.6 Electronic neural network for Problem P5.6. 


P5.9 The truncated energy function, £(v), of a certain two-neuron network is 
specified as 

£(V) = ~\ ( V 1 + 2 V 1 V 2 + 4v 2 + V l) 

Assuming high-gain neurons: 

(a) Find the weight matrix W and the bias current vector i. 

(b) Determine whether single-layer feedback neural network postulates 
(symmetry and lack of self-feedback) are fulfilled for W and i com- 
puted in part (a). 

P5.10 Assuming the energy functions and other conditions as in Problem P5.9, 
find 

(a) the gradient vector of the energy function, V£(v) 

(b) the Hessian matrix of the energy function, V 2 £(v) 

(c) any unconstrained minima or maxima v* the energy function may 
have. 

P5.ll The truncated energy function of a certain three-neuron single-layer net- 
work is known as 

E(y) = v\ - v\ + V 3 - 2v 1 v 3 - v 2 v 3 + Av x + 12 

Find the following: 

(a) the gradient vector, V£(v) 

(b) the Hessian matrix, V 2 £(v) 
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Figure P5.7 NetWork for Problem P5.7. 


(c) any minima, maxima, or saddle points the energy function may have. 

What are they? 

P5.12 The high-gain neuron continuous-time network is shown in Figure P5.12. 
By evaluating its energy function, find analytically any minima, maxima, 
or saddle points the function has within the [—1,1] three-dimensional 
cube. Use the truncated energy expression containing no third term. Then 
compute numerical energy values at each of the cube’s eight vertices. 

P5.13 The energy function E(v) for the two-bit A/D converter discussed in 
Example 5.3 is specified by Equation (5.35b). It has been derived under 
the assumption of infinite gain neurons. Calculate the accurate energy 
function at v = [0.01 0.01 ] ' using the accurate expression for energy 

(5.14) which, in addition to (5.35b), involves the additional integral term 
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(Weight values in mhos) 

Figure P5.12 High-gain neuron network for Problem P5.12. 


(5.38c). This term is equal to 

i p 

I G, / fr\ z ) dz 

/=0 J 0.5 

Assume the converter as in Figure 5.6(b) with g 0 = 8i ~ 1 fi 1 , x = 1.7, 
and 

v,* = /fa) = — \ ■ 

1 + exp(—10« i ) 

P5.14 Calculate the truncated energy function £(v 0 ,v 1 ,v 2 ) for a three-bit A/D 
converter that uses high-gain neurons. Then calculate the point v* for 
which V£(v*) = 0. Based on the evaluation of the Hessian matrix, 
V 2 £(v*), determine what is the type of point v* computed. 

P5.15 A single-layer feedback neural network is described by the nonlinear dif- 
ferential equations, which must be solved numerically 

dw 1 ( n \ 

( I WjjVj - ufli + iA , for 1 = 1,2,...,/! 

and 



v,- = f(Mi) 
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(a) Derive the forward Euler numerical integration formula to find for 
i = 1, 2, ..., n. 

(b) Derive the backward Euler numerical integration formula to find u h 
for i = 1 , 2 ,...,«. 


P5.16* Implement the backward Euler numerical integration formula derived in 
Problem P5.15 to solve for v(f). Obtain the solution for a simple two-bit 
A/D converter as in Figure 5.6. Assume C 0 = C j = 1F, g 0 = g l = 
1 ft -1 , jc = 0, and 


V/ =/(«,) = T— 7—. A , , 

1 + exp(— 10«,) 


for i = 0, 1 


Test the program for the following initial conditions: v 0 = [0.25 0.25 ]\ 
v 0 = [0.5 0.5]', and v 0 = [0.75 0.75]'. 

P5.17 Rearrange the nonlinear differential equations (5.28) describing the sim- 
ple two-bit A/D converter to the form (5.34) so that the vector field 
components t fi 0 , and ifi 

^ = «Ao( v o.Vi) 
dvi 


can be explicitly computed for known values of x h g,, C„ and A, (i = 
0,1). Find </b( v 0’ v i) an d ( v o> v i ) f° r the activation function v, = [1 + 
exp(— A«,)] -1 . 

P5.18 The stationary point solution of Equations (5.28) describing the simple 
two-bit A/D converter can be obtained using the Newton-Raphson method 
by solving the nonlinear algebraic equation F(u) = 0, where 


F(u) = 


-2vj + x - 0.5 - (g 0 - 2 )h 0 
-2v 0 + 2x - 2 - (g l - 2)h, 


The iterative solution is obtained in this method as shown: 

F 0 ( u*)‘ 

Fi( u*)_ 

where the Jacobian matrix is defined 


t k + 1 


,k + 1 


L"i 


i u n 


[J(U)] 


-1 


<~o 1 

i 

dF o' 

du 0 

du { 

9F , 

dF i 

1 <=> 
S 

1 

1 

du x 


Find the closed form for the iterative solution of the equation F(u) = 0, 
including the Jacobian matrix entries (no matrix inversion required). The 
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A/D 2- bit Converter 

Input x=1.6, Lambda = 2 
Neuron Input Conductance = 2.5 


- 0.8550 



0.02 0.03 0.04 0.05 0.06 



v 0 


(a) (b) 

Figure P5.19 Total energy function for the two-bit A/D converter as in Problem P5.19 for 
x = 1.6, A = 2, and gb = Øi = 2.5 O -1 (a) energy surface (b) energy contour map. 


activation function to be assumed is 

v, = [l + exp(— åm,)] 1 , for i = O, 1 

P5.19 Figure P5.19 illustrates the energy surface and the energy contour map 
near the upper left comer of the unity square. The graphs shown are made 
for the two-bit A/D converter and the following conditions: x = 1.6, 
A = 2, and g 0 = 8\ = 2.5 fi -1 . The displayed energy function expresses 
the total energy of the network, thus it involves the integral term (5.38c) 
additive with the truncated energy function (5.35b). Compute the numer- 
ical value of the total energy function at the energy minimum located 
approximately at 

v 0 * = 0.03, v x * = 0.99 

P5.20 Figure P5.20 illustrates two cases of vector helds for the two-bit A/D 
converter. Each of the continuous lines drawn within the helds are for the 
condition dv 0 /dt — 0 and dvjdt = 0. Obviously, stationary points are 
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(a) 


(b) 


Figure P5.20 Vector fields for Problem P5.20. 


located at the intersections of the lines, since both dv 0 /dr = dv x /dt ~ 0 
there. By analyzing the vector field determine the following conditions: 

(a) What are the types of the stationary points displayed in each of the 
figures? 

(b) What is the approximate analog input value for the vector field of 
Figure P5.20(a)? 
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\ / 

I n the preceding chapter we were concemed with dynamical systems that can 
be used in information processing systems. As we have shown, their dynamic 
behavior exhibits stable States that act as attractors during the system’s evolu- 
tion in time. Our discussion of dynamical systems thus far has been primarily 
oriented toward solving optimization problems. In this chapter, we will interpret 
the system’s evolution as a movement of an input pattem toward another stored 
pattem, called a prototype or stored memory. Specifically, we will look at budding 
associations for pattem retrieval and restoration. We will also study the dynamics 
of the discrete-time convergence process. Neural networks of this class are called 
associative memories and are presented in this chapter. 

An efficient associative memory can store a large set of pattems as mem- 
ories. During recall, the memory is excited with a key pattern (also called the 
search argument) containing a portion of information about a particular member 
of a stored pattem set. This particular stored prototype can be recalled through 
association of the key pattem and the information memorized. A number of 
architectures and approaches have been devised in the literature to solve effec- 
tively the problem of both memory recording and retrieval of its content. While 
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most of our discussion in this chapter will involve dynamical systems using fully 
coupled feedback net works from Chapter 5, feedforward memory architectures 
that employ no feedback will also be discussed. Since feedforward memories 
were the first developed, their study seems to be both informative and provides 
insight into the fundamental concepts of neural memories. 

Associative memories belong to a class of neural networks that leams ac- 
cording to a certain recording algorithm. They usually acquire information a 
priori , and their connectivity (weight) matrices most often need to be formed in 
advance. Writing into memory produces changes in the neural interconnections. 
Reading of the stored information from memory, introduced in Chapter 2 as re- 
call, can be considered as the transformation of the input signals by the network. 
No usable addressing scheme exists in an associative memory since all memory 
information is spatially distributed and superimposed throughout the network. 

Let us review the expectations regarding associative memory capabilities. 
The memory should have as large a capacity as possible or a large p value, which 
denotes the number of stored prototypes. At the same time the memory should be 
able to store data in a robust manner, so that local damage to its structure does 
not cause total breakdown and inability to recall. In addition, the ideal memory 
should truly associate or regenerate stored pattem vectors and do so by means 
of specific similarity criteria. Another very desirable feature of memory would 
be its ability to add and eliminate associations as storage requirements change. 

Associative memory usually enables a parallel search within a stored data 
file. The purpose of the search is to output either one or all stored items that 
match the given search argument, and to retrieve it either entirely or partially. It is 
also believed that biological memory operates according to associative memory 
principles. No memory locations have addresses; storage is distributed over a 
large, densely interconnected, ensemble of neurons. What exactly is meant by 
that network of interconnections is seldomly defined for biological systems. The 
operating principles of artificial neural memory models are sometimes also very 
involved, and their presentation in this chapter is by no means exhaustive. The 
intention is to provide an understanding of basic associative memory concepts 
and of the potential benefits, applications, and limitations. 


6.1 — — — 

“ BASIC CONCEPTS 

Figure 6.1 shows a general block diagram of an associative memory per- 
forming an associative mapping of an input vector x into an output vector v. The 
system shown maps vectors x to vectors v, in the pattem space R n and output 
space R m , respectively, by performing the transformation 

v = M[x\ (6.1) 

The operator M denotes a general nonlinear matrix-type operator, and it has 
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Figure 6.1 Block diagram of an associative memory. 


different meaning for each of the memory models. Its form, in faet, defines a 
specific model that will need to be carefully outlined for each type of memory. 
The structure of M reflects a specific neural memory paradigm. For dynamic 
memories, M also involves time variable. Thus, v is available at memory output 
at a later time than the input has been applied. 

For a given memory model, the form of the operator M is usually expressed 
in terms of given prototype vectors that must be stored. The algorithm allowing 
the computation of M is called the recording or storage algorithm. The operator 
also involves the nonlinear mapping performed by the ensemble of neurons. 
Usually, the ensemble of neurons is arranged in one or two layers, sometime 
intertwined with each other. 

The mapping as in Equation (6.1) performed on a key vector x is called a 
retrieval. Retrieval may or may not provide a desired solution prototype, or an 
undesired prototype, but it may not even provide a stored prototype at all. In 
such an extreme case, erroneously recalled output does not belong to the set of 
prototypes. In the following sections we will attempt to define mechanisms and 
conditions for efficient retrieval of prototype vectors. 

Prototype vectors that are stored in memory are denoted with a superscript 
in parenthesis throughout this chapter. As we will see below, the storage algo- 
rithm can be formulated using one or two sets of prototype vectors. The storage 
algorithm depends on whether an autoassociative or a heteroassociative type of 
memory is designed. 

Let us assume that the memory has certain prototype vectors stored in such 
a way that once a key input has been applied, an output produced by the memory 
and associated with the key is the memory response. Assuming that there are p 
stored pairs of associations defined as 

x® V®, for i = 1, 2, . . . , p 


(6.2a) 
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Figure 6.2 Addressing modes for memories: (a) address-addressable memory and (b) content- 
addressable memory. 


and v w =£ x (l) , for i = 1, 2, .... p, the network can be termed as heteroas- 
sociative memory. The association between pairs of two ordered sets of vectors 

{x (1 \ x (2) x (p) } and {v (1) , v (2) , . . . ,v (p) } is thus heteroassociative. An exam- 

ple of heteroassociative mapping would be a retrieval of the missing member of 
the pair (x <0 , v (,) ) in response to the input x (,) or v (l) . If the mapping as in (6.2a) 
reduces to the form 

x (0 v (0 | v(i)=xW , for i = 1, 2, . . . , p (6.2b) 

then the memory is called autoassociative. Autoassociative memory associates 
vectors from within only one set, which is {x (1) ,x (2) , . . . ,x (/?) }. Obviously, the 
mapping of a vector x ^ into itself as suggested in (6.2b) cannot be of any 
significance. A more realistic application of an autoassociative mapping would 
be the recovery of an undistorted prototype vector in response to the distorted 
prototype key vector. Vector x (/) can be regarded in such case as stored data and 
the distorted key serves as a search key or argument. 

Associative memory, which uses neural network concepts, bears very little 
resemblance to digital computer memory. Let us compare their two different 
addressing modes which are commonly used for memory data retrieval. In dig- 
ital computers, data are accessed when their correct addresses in the memory 
are given. As can be seen from Figure 6.2(a), which shows a typical memory 
organization, data have input and output lines, and a word line accesses and 
activates the entire word row of binary cells containing word data bits. This 
activation takes place whenever the binary address is decoded by the address 
decoder. The addressed word can be either “read” or replaced during the “write” 
operation. This is called address-addressable memory. 
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In contrast with this mode of addressing, associative memories are content- 
addressable. The words in this memory are accessed based on the content of 
the key vector. When the network is excited with a portion of the stored data 
x (,) , i = 1, 2, the efficient response of the autoassociative network is 

the complete x (/) vector. In the case of heteroassociative memory, the content of 
vector x (/) should provide the stored response v w . However, there is no storage 
for prototype x (l) or v (z) , for i = 1, 2, at any location within the net- 

work. The entire mapping (6.2) is distributed in the associative network. This 
is symbolically depicted in Figure 6.2(b). The mapping is implemented through 
dense connections, sometimes involving feedback, or a nonlinear thresholding 
operation, or both. 

Associative memory networks come in a variety of models. The most im- 
portant classes of associative memories are static and dynamic memories. The 
taxonomy is based entirely on their recall principles. Static networks recall an 
output response after an input has been applied in one feedforward pass, and, 
theoretically, without delay. They were termed instantaneous in Chapter 2. Dy- 
namic memory networks produce recall as a result of output /input feedback 
interaction, which requires time. Respective block diagrams for both memory 
classes are shown in Figure 6.3. The static networks implement a feedforward 
operation of mapping without a feedback, or recursive update, operation. As 
such they are sometimes also called non-recurrent. Static memory with the block 
diagram shown in Figure 6.3(a) performs the mapping as in Equation (6.1), which 
can be reduced to the form 


v* = Mj[x*] 


(6.3a) 


where k denotes the index of recursion and M x is an operator symbol. Equation 
(6.3a) represents a system of nonlinear algebraic equations. Examples of static 
networks will be discussed in the next section. 

Dynamic memory networks exhibit dynamic evolution in the sense that they 
converge to an equilibrium State according to the recursive formula 


y 


= M 2 |x*,v* 


(6.3b) 


provided the operator M 2 has been suitably chosen. The operator operates at the 
present instant k on the present input x k and output \ k to produce the output in 
the next instant k + 1. Equation (6.3b) represents, therefore, a system of non- 
linear difference equations. The block diagram of a recurrent network is shown 
in Figure 6.3(b). The delay element in the feedback loop inserts a unity delay 
A, which is needed for cyclic operation. Autoassociative memory based on the 
Hopfield model is an example of a recurrent network for which the input x° is 
used to initialize v°, i.e., x° = v°, and the input is then removed. The vector 
retrieved at the instant k can be computed with this initial condition as shown: 


v* +1 = M 2 [\ k ] 


(6.3c) 
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(a) (b) 



(c) 


Figure 6.3 Block diagram representation of associative memories: (a) feedforward network, (b) 
recurrent autoassociative network, and (c) recurrent heteroassociative network. 


Figure 6.3(c) shows the block diagram of a recurrent heteroassociative memory 
that operates with a cycle of 2A. The memory associates pairs of vectors (x w , v (f) ), 
i = 1, 2, . . . , /?, as given in (6.2a). 

Figure 6.4 shows Hopfield autoassociative memory without the initializing 
input x 0 . The figure also provides additional details on how the recurrent mem- 
ory network implements Equation (6.3c). Operator M 2 consists of multiplication 
by a weight matrix followed by the ensemble of nonlinear mapping operations 
v t — f(net t ) performed by the layer of neurons. The details of processing were 
discussed in earlier chapters. 

There is a substantial resemblance of some elements of autoassociative re- 
current networks with feedforward networks discussed in Section 4.5 covering 
the back propagation network architecture. Using the mapping concepts pro- 
posed in (4.30c) and (4.31) we can rewrite expression (6.3c) in the following 
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(c) 


Figure 6.4 Autoassociative recurrent memory: (a) block diagram, (b) expanded block diagram, 
and (c) example State transition map. 


customary form: 


v *+i = p[Wv*] 


(6.4) 


where W is the weight matrix of a single layer as defined throughout Chapter 
4 or 5, The operator F[ • ] is a nonlinear matrix operator with diagonal elements 
that are hard-limiting (binary) activation functions/(*): 


'sgn(-) 0 0 

0 sgn(-) - O 

ri*] . . . 

. 0 0 • • ■ sgn(-) 


(6.5) 
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The expanded block diagram of the memory is shown in Figure 6.4(b). Although 
mappings performed by both feedforward and feedback networks are similar, re- 
current memory networks respond with bipolar binary values, and operate in a 
cyclic, recurrent fashion. Their time-domain behavior and properties will there- 
fore no longer be similar. 

Regarding the vector \(k + 1) as the State of the network at the (k + l)’th 
instant, we can consider recurrent Equation (6.4) as defining a mapping of the 
vector v into itself. The memory State Space consists of 2 n rc-tuple vectors with 
components ±1. The example State transition map for a memory network is 
shown in Figure 6.4(c). Each node of the graph is equivalent to a State and 
has one and only one edge leaving it. If the transitions terminate with a State 
mapping into itself, as is the case of node A, then the equilibrium A is the fixed 
point. If the transitions end in a cycle of States as in nodes B , then we have a 
limit cycle solution with a certain period. The period is defined as the length of 
the cycle. The figure shows the limit cycle B of length three. 

Let us begin with a review of associative memory networks beginning with 
static networks. 


~ LINEAR ASSOCIATOR 

Traditional associative memories are of the feedforward, instantaneous type. 
As defined in (6.2a), the task required for the associative memory is to leam the 
association within p vector pairs for / = 1, 2, . . . , p. For the linear 

associative memory, an input pattem x is presented and mapped to the output by 
simply performing the matrix multiplication operation 

v = Wx (6.6a) 

where x, v, W are matrices of size nXl,mXl, and mXn, respectively. Thus, the 
general nonlinear mapping relationship (6.3a) has been simplified to the linear 
form (6.6a), hence the memory name. The linear associative network diagram 
can be drawn as in Figure 6.5. Only the customary weight matrix W is used to 
perform the mapping. Noticeably, the network does not involve neuron elements, 
since no nonlinear or delay operations are involved in the linear association. If, 
however, the use of neurons, is required for the reason of uniform perspective of 
all neural networks, then the mapping (6.3a) can be rewritten as 

v = M 1 [Wx] (6.6b) 

where M { [-] is a dummy linear matrix operator in the form of the m X m unity 
matrix. This observation can be used to append an output layer of dummy neurons 
with identity activation functions v, = fineti) = net t . The corresponding network 
extension is shown within dashed lines in Figure 6.5. 
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Figure 6.5 Linear associator. 


Let us assume that p associations need to be stored in the linear associator. 

Given are pairs of vectors {s (i) ,f ( ' ) }, for i — 1, 2, .... p, denoting the stored 

memories, called stimuli, and forced responses, respectively. Since this memory 
is strictly unidirectional, these terms are self-explanatory. We thus have for n- 
tuple stimuli and m-tuple response vectors of the i’th pair: 

s® = | sf sf... sjp] , and (6.6c) 

f®=[tf> ff... /*>]' <6.6d) 

In practice, s (l) can be pattems and f (z) can be information about their class 
membership, or their images, or any other pairwise assigned association with in- 
put pattems. The objective of the linear associator is to implement the mapping 
(6.6a) as follows 

f (/) + x\ l = Ws (0 

or, using the mapping symbol 

s (0 f(0 + for i = 1,2, . . . , p (6.7) 

such that the length of the noise term vector denoted as t|* is minimized. In 
general, the solution for this problem aimed at tinding the memory weight matrix 
W is not very straightforward. First of all, matrix W should be found such that 
the Euclidean norm X* ||n* ||, is minimized for a large number of observations 
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of mapping (6.7). This problem is dealt with in the mathematical regression 
analysis and will not be covered here. The general solution of the problem of 
optimal mapping (6.7) can be found in references on early associative memories 
(Kohonen 1977; Kohonen et al. 1981). We will find below, however, partial 
solutions to the general mapping problem expressed by Equation (6.7) that will 
also prove to be useful later in this chapter. 

Below we focus on tinding the matrix W that allows for efficient storage 
of data within the memory. Let us apply the Hebbian leaming rule in an attempt 
to train the linear associator network. The weight update rule for the f th output 
node and / th input node can be expressed as 

w 'ij = w ij +fi s r for i = 1, 2, . . . , m and j = 1, 2, . . . , n (6.8a) 

where / and Sj are the f th and /th components of association vectors f and s 
and Wij denotes the weight value before the update. The reader should note that 
the vectors to be associated, f and s, must be members of the pair. To generalize 
formula (6.8a) so it is valid for a single weight matrix entry update to the case 
of the entire weight matrix update, we can use the outer product formula. We 
then obtain 

W' = W + fs' (6.8b) 

where W denotes the weight matrix before the update. Initializing the weights in 
their unbiased position W 0 = 0, we obtain for the outer product leaming rule: 

W' = f (0 s (,) ' (6.9a) 

Expression (6.9a) describes the first leaming step and involves leaming of the 
f th association among p distinet paired associations. Since there are p pairs to 
be leamed, the superposition of weights can be performed as follows 

W' = £ f (i) s ( ' )( (6.9b) 

1 = 1 

The memory weight matrix W' above has the form of a cross-correlation matrix . 
An alternative notation for W' is provided by the following formula: 

W' = FS r (6.9c) 

where F and S are matrices containing vectors of forced responses and stimuli 
and are defined as follows: 

F=[fd) f(2) ... f<p>] 

S = [s™ s< 2 > ... s (p) ] 

where the column vectors f (l) and s 0) were defined in (6.6c) and (6.6d). The 
resulting cross-correlation matrix W' is of size m X n. Integers n and m denote 
sizes of stimuli and forced responses vectors, respectively, as introduced in (6.6c) 
and (6.6d). We should now check whether or not the weight matrix W provides 
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noise-free mapping as required by expression (6.7). Let us attempt to perform 
an associative recall of the vector f (z) when s (z) is applied as a stimulus. If one 
of the stored vectors, say is now used as key vector at the input, we obtain 
from the retrieval formula [(6.6a) and (6.6b)]: 

V = ( £ f (i) s (i) ^ s 0) (6.10a) 

Expanding the sum of p terms yields 

v _ f(D s (i)f s O) + ... + fO') s O')' s O) + . . . + f(p) s (p)‘ s (j) (6.10b) 

According to the mapping criterion (6.7), the ideal mapping s (;) —* f (j) such that 
no noise term is present would require 


v = f (y) 


(6.10c) 


By inspecting (6.10b) and (6.10c) it can be seen that the ideal mapping can be 
achieved in the case for which 


s O)f s O) = ø ? f or i ^ j 

s U)t s (j) = i 


( 6 . 11 ) 


Thus, the orthonormal set of p input stimuli vectors ensures 

perfect mapping (6.10c). Orthonormality is the condition on the inputs if they 
are to be ideally associated. However, the condition is rather strict and may not 
always hold for the set of stimuli vectors. 

Let us evaluate the retrieval of associations evoked by stimuli that are not 
originally encoded. Consider the consequences of a distortion of pattem 
submitted at the memory input as s (y) so that 

s OT = s 0‘) + (6.12) 

where the distortion term A (y) can be assumed to be statistically independent of 
s (y) , and thus it can be considered as orthogonal to it. Substituting (6.12) into 
formula (6.10a), we obtain for orthonormal vectors originally encoded in the 
memory 

v = f O') s O)' s O) + f(j) s U)‘£U) + £(f0) s 0>) A (y) (6.13a) 

i*i 

Due to the orthonormality condition this further reduces to 

v = f (y) + £(f (0 s (0/ )A (y) (6.13b) 

i *j 

It can be seen that the memory response contains the desired association f (y) and 
an additive component, which is due to the distortion term A (;) . The second term 
in the expression above has the meaning of cross-talk noise and is caused by 
the distortion of the input pattem and is present due to the vector A (y) . The term 
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contains, in parentheses, almost all elements of the memory cross-correlation 
matrix weighted by a distortion term A (;) . Therefore, even in the case of stored 
orthonormal pattems, the cross-talk noise term from all other pattems remains 
additive at the memory output to the originally stored association. We thus see 
that the linear associator provides no means for suppression of the cross-talk noise 
term is of limited use for accurate retrieval of the originally stored association. 

Finally, let us notice an interesting property of the linear associator for 
the case of its autoassociative operation with p distinet n-dimensional prototype 
pattems s (/) . In such a case the network can be called an autocorrelator. Plugging 
f(0 = s (0 in (6.9b) results in the autocorrelation matrix W': 

W' = £ s (/) s (0 ' (6.14a) 

/=i 

This result can also be expressed using the S matrix from (6.9c) as follows 

W' = SS' (6.14b) 

The autocorrelation matrix of an autoassociator is of size n X n. Note that this 

matrix can also be obtained direetly from the Hebbian leaming rule. Let us 
examine the attempted regeneration of a stored pattem in response to a distorted 
pattem s (y) ’ submitted at the input of the linear autocorrelator. Assume again that 
input is expressed by (6.12). The output can be expressed using (6.10b), and it 
simplifies for orthonormal pattems s (y) , for j = 1, 2, to the form 

v = s (y) + X s (l) s (,) 'A (7) (6.15a) 

**j 

This becomes equal 

v = s ( '> + (p — 1)A (;) (6.15b) 

As we can see, the cross-talk noise term again has not been eliminated even for 
stored orthogonal pattems. The retrieved output is the stored pattem plus the dis- 
tortion term amplified p— 1 times. Therefore, linear associative memories perform 
rather poorly when retrieving associations due to distorted stimuli vectors. 

Linear associator and autoassociator networks can also be used when linearly 
independent vectors s (1) , s (2) , . . . , s (p) , are to be stored. The assumption of linear 
independence is weaker than the assumption of orthogonality and it allows for 
consideration of a larger class of vectors to be stored. As discussed by Kohonen 
(1977) and Kohonen et al. (1981), the weight matrix W can be expressed for 
such a case as follows: 

W - F(S'S) _1 S' (6.16) 

The weight matrix found from Equation (6.16) minimizes the squared output 
error between and v (y) in the case of linearly independent vectors (see 
Appendix). Because vectors to be used as stored memories are generally neither 
orthonormal nor linearly independent, the linear associator and autoassociator 
may not be efficient memories for many practical tasks. 
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Linear auto- and heteroassociative memory networks are in a certain limited 
sense similar to dynamic associative memories. In particular, both memories 
have similar recording algorithms. However, as shown, linear associators do not 
provide solutions for the suppression of the noise term in expressions (6.13b) 
and (6.15). We will show that other memory models utilize a thresholding oper- 
ation to suppress the noise component present in an output signal. The recurrent 
autoassociative memory presented in the following section are examples of such 
memory models. 


BASIC CONCEPTS 
(/ OF RECURRENT 

AUTOASSOCIATIVE MEMORY 

Discussion of the linear associative memory has pointed out the need for 
suppressing the output noise at the memory output. This can be done by thresh- 
olding the output and by recycling of the output to input in order to produce an 
improved association. The repetitive process of recycling of the output followed 
by a feedforward pass through the network can be performed by a recurrent neural 
network. Such a network is similar to the linear associative memory, however, it 
has feedback, nonlinear mapping in the form of thresholding, and is dynamical. 

Recurrent autoassociative memory has already been introduced as an exam- 
ple of dynamic associative network in Figure 6.4. The reader should by now 
be familiar with the general theory of such a network. The memory is essen- 
tially a single-layer feedback network covered in Section 5.2 as a discrete-time 
network originally proposed by Hopfield (1982, 1984). In this section, a more 
detailed view of the model and its performance as an associative memory will 
be presented. Our focus is mainly on the dynamical performance of recurrent 
autoassociative memories. In contrast to continuously updating single-layer feed- 
back networks discussed throughout Chapter 5, associative memories are updated 
in discrete time. 

An expanded view of the Hopfield model network from Figure 6.4 is shown 
in Figure 6.6. Figure 6.6(a) depicts Hopfield' s autoassociative memory. Under the 
asynchronous update mode, only one neuron is allowed to compute, or change 
State, at a time, and then all outputs are delayed by a time A produced by the 
unity delay element in the feedback loop. This symbolic delay allows for the 
time-stepping of the retrieval algorithm embedded in the update rule of (5.3) or 
(5.4). Figure 6.6(b) shows a simplified diagram of the network in the form that 
is often found in the technical literature. Note that the time step and the neu- 
rons’ thresholding function have been suppressed on the figure. The computing 
neurons represented in the figure as circular nodes need to- perform summation 
and bipolar thresholding and also need to introduce a unity delay. Note that the 
recurrent autoassociative memories studied in this chapter provide node responses 
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(a) 



Figure 6.6 Hopfield model autoassociative memory (recurrent autoassociative memory): 
(a) expanded view and (b) simplified diagram. 
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of discrete values ±1. The domain of the /z-tuple output vectors in IR" are thus 
vertices of the /z-dimensional cube [—1, 1]. 


Retrieval Algorithm 


Based on the discussion in Section 5.2 the output update rule for Hopfield 
autoassociative memory can be expressed in the form 


v. 


k+i _ 


sgn 


L?."w 


(6.17) 


where k is the index of recursion and i is the number of the neuron currently 
undergoing an update. The update rule (6.17) has been obtained from (5.4a) 
under the simplifying assumption that both the extemal bias i t and threshold 
values 7) are zero for / = 1, 2, . .., n. These assumptions will remain valid 
for the remainder of this chapter. In addition, the asynchronous update sequence 
considered here is random. Thus, assuming that recursion starts at v°, and a 
random sequence of updating neurons m, p, q, . . . is chosen, the output vectors 
obtained are as follows 


First update: 
Second update: 
Third update: 




(6.18) 


Considerable insight into the Hopfield autoassociative memory performance can 
be gained by evaluating its respective energy function. The energy function (5.5) 
for the discussed memory network simplifies to 

£(v) = -Iv'Wv (6.19a) 

We consider the memory network to evolve in a discrete-time mode, for k — 1, 
2, and its outputs are one of the 2 n bipolar binary n - tuple vectors, each 
representing a vertex of the /z-dimensional [— 1,+1] cube. We also discussed 
in Section 5.2 the faet that the asynchronous recurrent update never inereases 
energy (6.19a) computed for v = v\ and that the network setties in one of the 
local energy minima located at cube vertices. 

We can now easily observe that the complement of a stored memory is also 
a stored memory. For the bipolar binary notation the complement vector of v is 
equal to — v. It is easy to see from (6.19a) that 
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and thus both energies £(v) and E(-\) are identical. Therefore, a minimum of 
£(v) is of the same value as a minimum of £(-v). This provides us with an 
important conclusion that the memory transitions may terminate as easily at v as 
at -v. The crucial factor determining the convergence is the “similarity” between 
the initializing output vector, and v and — v. 


Storage Algorithm 

Let us formulate the information storage algorithm for the recurrent autoas- 
sociative memory. Assume that the bipolar binary prototype vectors that need to 
be stored are s (m \ for m = 1, 2, p. The storage algorithm for calculating 
the weight matrix is 

W = £ s (m) s (m) ' - pi (6.20a) 

m=l 

or 

w,j - (i - Si,) i »IT (6.20b) 

where, as before, Sy denotes the usual Kronecker function Sy = 1 if i = j, and 
S ? y = 0 if i # j. The weight matrix W is very similar to the autocorrelation 
matrix obtained using Hebb’s leaming rule for the linear associator introduced 
in (6.14). The difference is that now w H = 0. Note that the system does not 
remember the individual vectors s^ m) but only the weights w ( p which basically 
represent correlation terms among the vector entries. 

Also, the original Hebb’s leaming rule does not involve the presence of 
negative synaptic weight values, which can appear as a result of leaming as in 

(6.20) . This is a direct consequence of the condition that only bipolar binary 
vectors s (m) are allowed for building the autocorrelation matrix in (6.20). Inter- 
estingly, additional autoassociations can be added at any time to the existing 
memory by superimposing new, incremental weight matrices. Autoassociations 
can also be removed by respective weight matrix subtraction. The storage rule 

(6.20) is also invariant with respect to the sequence of storing pattems. 

The information storage algorithm for unipolar binary vectors s (m) , for m = 
1, 2, . . . , p 9 needs to be modified so that a -1 component of the vectors simply 
replaces the 0 element in the original unipolar vector. This can be formally 
done by replacing the entries of the original unipolar vector s w with the entries 
2s- w) — 1, i = 1, 2, . . . , n. The memory storage algorithm (6.20b) for the unipolar 
binary vectors thus involves scaling and shifting and takes the form 

Wij = (1 - Sy) £ (2s- m) - l)(2sj m) - 1) (6.21) 

m= 1 

Notice that the information storage rule is invariant under the binary complement 
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(c) (d) 

Figure 6.7 Example of an asynchronous update of a corrupted negative image of a bit map of 
digit 4: (a) key vector, (b) after first cycle, (c) after second cycle, and d) after third cycle. 


operation. Indeed, storing complementary pattems s (m) instead of original pattems 
s (m) results in the weights as follows: 

w'ij = (1 - Sij) £ ( 2 sf m) - 1 )( 2 sf m) - 1 ) ( 6 . 22 ) 

771—1 

The reader can easily verify that substituting 

sf m) = 1 - (6.23) 

into (6.22) results in w\j — w^. Figure 6.7 shows four example convergence 
steps for an associative memory consisting of 120 neurons with a stored binary 
bit map of digit 4. Retrieval of a stored pattem initialized as shown in Figure 
6.7(a) terminates after three cycles of convergence as illustrated in Figure 6.7(d). 
It can be seen that the recall has resulted in the true complement of the bit 
map originally stored. The reader may notice similarities between Figures 5.2 
and 6.7. 
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EXAMPLE 6.1 """“i"™"™"" 

In this example we will discuss the simplest recurrent associative memory 
network known, which at the same time, seems to be the most commonly 
used electronic network in the world. Although the network has been popular 
for more than half a century, we will take a fresh look at it and investigate 
its performance as of a neural memory. The bistable flip-flop is the focus of 
our discussion. In faet, most digital computers use this element as a basic 
memory cell and as a basic computing device. 

The flip-flop network is supposed to find and indefinitely remain in one 
of the two equilibrium States: s (1) = [ 1 — 1 ] l or s (2) = [ — 1 1 ]\ 

Before we design a bistable flip-flop using the associative memory 
concept, let us review its basic physical properties. The network as shown 
in Figure 6.8(a) consists of two cascaded inverting neurons. Consider the 
feedback loop switch K to be open initially. Each of the neurons represented 
by amplifiers with saturation is characterized by the two identical voltage 
transfer characteristics as shown in Figures 6.8(b) and (c). 

Resistances R x and R 2 have to be positive, but their values are of no 
significance since the ideal neurons absorb no input current. By connecting 
two neurons in Cascade, the output v 2 is a funetion of v 1? and v x is a funetion 
of v 0 . Thus, v 2 can be expressed as a composite funetion of v 0 . The Cascade 
of two amplifiers has a composed characteristic v^v^Vq)] as illustrated in 
Figure 6.8(d). When the switch K closes and enforces the condition v 2 — v 0 
represented by the unity-slope straight line on the figure, the circuit imme- 
diately starts evolving toward one of the equilibrium points. Note that the 
three potential equilibrium points are A, B , and S. 

The origin S of the coordinates here is an unstable equilibrium point. If 
any perturbation arises in either direction with the network momentarily at 
S, it is going to be magnified by the positive feedback within the network so 
the transition will finally end in either A or B. Thus, the physical memory 
network will never stabilize at 5, which is supposedly the saddle point of 
its energy funetion. Note that the equilibrium States for high-gain amplifiers 
are, for points A and £, respectively: 


V 

__ 

f 


V ll 

-r 

. v 2. 


-1 


- v 2j 

i 


The dynamics of the equilibrium search, which is essential for the network’s 
physical operation during the transition period, has been suppressed in this 
description for simplicity. The issue of dynamics was extensively discussed 
in Chapter 5. A trace of capacitance and neuron input conductance is, in faet, 
indispensable for this circuit to respond. The associative memory discussed 
here performs the input /output mapping, and discrete-time formalism is 
used only to characterize it. The physical output update process, however, 
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V 2 



Figure 6.8 Bistable flip-flop: (a) Circuit diagram, (b) voltage transfer characteristic of ampli- 
fier 1, (c) voltage transfer characteristic of amplifier 2, and (d) joint characteristics and 
equilibrium States. 


follows the continuous-time behavior of the model, which uses the contin- 
uous activation function of the neurons. 

Knowing now the basic physical principles of the network, we can 
approach the problem of a bistable neural memory design with better un- 
derstanding. Since the leaming (storage) rule is invariant under the binary 
complement operation, it is sufficient to store either of the two pattems s (1) 
and s (2) only. From (6.20a) we obtain the weight matrix 


W = 


0 -1 
-1 o 


(6.25) 


This results in the bistable flip-flop Circuit shown in Figure 6.9(a) with 
weights of w x2 = tv 21 = — lft. The Circuit has negative resistances, which 
are difficult to implement. It has, however, noninverting neurons. By trans- 
forming noninverting neurons into inverting neurons and by changing the 
signs of the resistances, an equivalent Circuit is obtained as shown in Figure 
6.9(b). Although the equivalency is valid strictly for a hard-limiting acti- 
vation function and zero input conductance of the neurons, both circuits of 
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Figure 6.9 Bistable flip-flop as an associative memory: (a) Circuit diagram and (b) realizable 
Circuit diagram. 


Figure 6.9 can be considered equivalent for all practical purposes. As we 
can see, the memory circuit obtained is identical to the basic flip-flop circuit 
configuration introduced earlier in Figure 6.8(a). 

To complete our study of the network, let us consider its energy func- 
tion. Substituting (6.25) into (6.19), the energy function of this memory 
network is obtained as 


E = VjV 2 (6.26) 

The energy function is shown in Figure 6.10(a). It consists of two surfaces 
monotonically decreasing toward s (l) and s (2) , where s (1) = [l — 1] / and 
s ,2) — [ — 1 1 j ‘ . The ridge of the function is at v, = v 2 . It can easily be 

seen that for the initial condition v° < v°, the update process will result in 
stable output pattem s (2) . For > v®, the network will stabilize at output 
s (1) . The figure shows that this network has a virtually perfect recall ability 
to the doser of the two memories stored. These properties refer to the up- 
date by the continuous-time network employing neurons with the continuous 
activation function. 

Let us inspect the performance of the memory operating in the discrete- 
time mode. The memory cell now employs two bipolar binary neurons. In 
Section 5.2 we considered the recurrent memory network with the weight 
matrix as in (6.25). We then showed that the synchronous transitions of the 
network were oscillations between [ - 1 — 1 ] 1 and [ 1 1 ] \ provided the 

updates originated at one of these outputs. The transition map for this case 
is depicted in Figure 6.10(b). It can be seen from the map that the energy 
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Figure 6.10 Energy function and transition maps for the bistable flip-flop: (a) energy surface, 
(b) transition map for synchronous update, and (c) transition map for asynchronous update. 


minimization property does not hold for the synchronous update mode since 
there are lower energy levels that cannot be reached. This is the case for 
oscillatory updates between the energy levels of value 1 never reaching the 
lower value of —1. 


334 


ASSOCIATIVE MEMORIES 


Consider now the asynchronous updates for this circuit. Starting at each 
of the first four initial vectors v° we obtain in a single update step: 

L For v° = [ -1 ~l]* and the first neuron allowed to update 

v 1 = r f ? “il f“!l = [ il = S (1 > 


0 

-1 


'-f 


r 

-1 

0 


-1 


-i 


Energy has dropped from 1 to —1 as a result of the update and the 
network has settled at s (1) . Starting again at v° as above but with the 
second neuron allowed to update, we obtain 


v 


i 


0 

-r 

-r 


-r 

-i 

0. 

-i 


i 


= s 


( 2 ) 


and we have the same energy drop from 1 to —1 while the network 
settled at s (2) . 

2. For v° = [ 1 1 Y and the first neuron allowed to update 



Energy has dropped from 1 to — 1 as a result of this update. Allowing 
the second neuron to update, we obtain 



and we have the same energy drop from 1 to — 1. 

3. For v° = [ —1 1 y we obtain 


v 1 = r 


0 

-i 


-r 


-r 

-i 

0 


i 


i 


= s (2) 


independent of the update sequence. The energy value remains at the 
lo west level of —1. 

4. For v° = [ 1 — 1 ] 1 we obtain 


v 1 =r 


0 

-r 


r 


r 

-i 

0 


-i 


-i 


- s 


( 1 ) 


independent of the update sequence. As in step 3, energy remains at 
the lo west level of —1. 


The summary of the transitions in the asynchronous mode for this simple 
associative memory is illustrated in Figure 6.10(c). The figure shows the 
discussed transitions between the four vertices of the graph. Case numbers 
correspond to initial node numbers which are circled on the figure. må 
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Because of the network size, the presented example has been more ed- 
ucational than of any practical value. The practical importance of recurrent 
associative memory is for larger size networks. In faet, many of its attractive 
properties related to recall and noise suppression are valid under statistical inde- 
pendence of stored vectors and for large n values. As stated before, the network 
is able to produce correct stored States when an incomplete or noisy version 
of the stored vector is applied at the input and after the network is allowed to 
update its output asynchronously. 

m Summary of the Recurrent Associative Memory Storage and Retrieval Algorithm 
(RAMSRA) . 

Given are p bipolar binary vectors: 

|s (1 \s (2 \ . . . j , where s (m) is ( n X 1), for m = 1, 2, . . . , p 

Initializing vector v° is (n X 1) 

Storage 

Step 1 : Weight matrix W is (n X n ): 

W<-0, m <— l 

Step 2: Vector s (m) is stored: 

W _ j 

Step 3: If m < p then m m + 1, and go to Step 2, otherwise, go 
to Step 4. 

Step 4: Recording of vectors is completed. Output weights W. 

Recall 

Step 1 : Cycle counter k is initialized, k 1. Update counter i within 
cycle is initialized, i 1 and the network is initialized, v <— v°. 

Step 2: Integers 1, 2, n, are arranged in an ordered random 

sequence a 1? a 2 , . . . , a n for this cycle. (See note at end of list.) 

Step 3: Neuron i is updated by computing vnew a . 

net a . = I w a .j Vj 
J= l 

vnew a . = sgn (net a .) 

Step 4: If i < n , then i <— i + 1, and go to Step 3; otherwise, go to 
Step 5. 
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Step 5: If vnew a . = v a ., for i = 1, 2, . . . , n, then no updates occur 
in this cycle; thus, recall is complete; output k and vnew x , vnew 2 , 
. . . , vnew n \ otherwise k*—k+ 1 and go to Step 2. 

NOTE: Recall is much simpler to implement without random- 

izing the update sequence. In such case, Step 2 of the algorithm 
is eliminated and we have a x = 1, a 2 = 2, . . . , a n = n, or simply 
a t = i at each update cycle. Elimination of Step 2 can, however, 
result in reduced efficiency of retrieval. 


f Performance Considerations 

Hopfield autoassociative memory is often referred to in the literature as an 
error correcting decoder in that, given an input vector that is equal to the stored 
memory plus random errors, it produces as output the original memory that is 
closest to the input. The reason why the update rule proposed by Hopfield can 
reconstruct a noise-corrupted or incomplete pattem can be understood intuitively. 
The memory works best for large n values and this is our assumption for further 
discussion of memory ’s performance evaluation. Let us assume that a pattem 
s (m '> has been stored in the memory as one of p pattems. This pattem is now 
at the memory input. The activation value of the fth neuron for the update mle 
(6.17) for retrieval of pattem s (w ^ has the following form: 


net^ = E wJf') 

7=1 

(6.27a) 

or, using (6.20b) and temporarily neglecting the eontribution coming from the 
nullification of the diagonal, we obtain 

net^ se i £ 

j=\m=\ J 

(6.27b) 

Thus, we can write 


netf') = £ sf) X sfsf") 
m= 1 7=1 J 

(6.27c) 


If terms Sj m) and sf 1 \ for j — 1,2 were totally statistically independent or 
unrelated for m = 1 , 2, . . . , p, then the average value of the second sum resulted 
in zero. Note that the second sum is the scalar product of two n-tuple vectors and 
if the two vectors are statistically independent (also when orthogonal) their prod- 
uct vanishes. If, however, any of the stored pattems s (m) , for m — 1, 2, . . . , /?, 
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and vector s 1 "'' 1 are somewhat overlapping, then the value of the second sum 
becomes positive. Note that in the limit case the second sum would reach n for 
both vectors being identical, understandably so since we have here the scalar 
product of two identical n-tuple vectors with entries of value ±1. Thus for the 
major overlap case, the sign of entry sf" ) is expected to be the same as that of 
net\ m \ and we can write 

netf } = s\ m,) n, for i = 1,2,...,« (6.27d) 

This indicates that the vector s (m * does not produce any updates and is therefore 
stable. 

Assume now that the input vector is a distorted version of the prototype 
vector s (m '\ which has been stored in the memory. The distortion is such that 
only a small percentage of bits differs between the stored memory s (m) and the 
initializing input vector. The discussion that formerly led to the simplification 
of (6.27c) to (6.27d) still remains valid for this present case with the additional 
qualification that the multiplier originally equal to n in (6.27d) may take a some- 
what reduced value. The multiplier becomes equal to the number of overlapping 
bits of s (m) and of the input vector. 

It thus follows that the impending update of node i will be in the same 
direction as the entry s- m \ Negative and positive bits of vector s 1 '"' 1 are likely to 
cause negative and positive transitions, respectively, in the upcoming recurrences. 
We may say that the majority of memory initializing bits is assumed to be correct 
and allowed to take a vote for the minority of bits. The minority bits do not 
prevail, so they are flipped, one by one and thus asynchronously, according to 
the will of the majority. This shows vividly how bits of the input vector can 
be updated in the right direction toward the closest prototype stored. The above 
discussion has assumed large n values, so it has been more relevant for real-life 
application networks. 

A very interesting case can be observed for the stored orthogonal pattems 
s (m) . The activation vector net can be computed as 

net = ^ S s w s» - s< m '> (6.28a) 

The orthogonality condition, which is s w 's (7) = 0, for i # j, and = n, for 

i = j, makes it possible to simplify (6.28a) to the following form 

net = (n- p) s (m) (6.28b) 

Assuming that under normal operating conditions the inequality n > p holds, the 
network will be in equilibrium at State s (m ) . Indeed, computing the value of the 
energy function (6.19) for the storage rule (6.20b) we obtain 

E(\) = — ^v' ( X v + ^v'plv 

2 \m = 1 / 2 


(6.29a) 
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For every stored vector s tm ) which is orthogonal to all other vectors the energy 
value (6.29a) reduces to 

E = -i £ (s (m,) V m) ) (s (m), s (m,) ) + is (m ' ), pls (m ' ) (6.29b) 
and further to 

£(s (m,) ) = -i(n 2 -pn) (6.29c) 

The memory network is thus in an equilibrium State at every stored prototype 
vector s (m '>, and the energy assumes its minimum value expressed in (6.29c). 

Considering the simplest autoassociative memory with two neurons and a 
single stored vector (n = 2, p = 1), Equation (6.29c) yields the energy minimum 
of value -1. Indeed, the energy function (6.26) for the memory network of 
Example 6.1 has been evaluated and found to have minima of that value. 

For the more general case, however, when stored pattems s (l) , s (2) , . . . , s (p) 
are not mutually orthogonal, the energy function (6.29b) does not necessarily 
assume a minimum at s (m ) , nor is the vector s (m ) always an equilibrium for the 
memory. To gain better insight into memory performance let us calculate the 
activation vector net in a more general case using expression (6.28a) without an 
assumption of orthogonality: 

net = ns (m ) — ps (m ) + £ (s < " ,, s ( '” ) ')s (m ' , (6.30a) 

This resulting activation vector can be viewed as consisting of an equilibrium 
State term ( n — p)s <m ) similar to (6.28b). In this case discussed before, either full 
statistical independence or orthogonality of the stored vectors was assumed. If 
none of these assumptions is valid, then the sum term in (6.30a) is also present 
in addition to the equilibrium term. The sum term can be viewed as a “noise” 
term vector r| which is computed as follows 

H = (W - s (m V m ' )r + i) s (m,) (6.30b) 

Expression (6.30b) allows for comparison of the noise terms relative to the 
equilibrium term at the input to each neuron. When the magnitude of the /’th 
component of the noise vector is larger than (n - p)s ( f n } and the term has the 
opposite sign, then s\ m ) will not be the network’s equilibrium. The noise term 
obviously increases for an increased number of stored pattems, and also becomes 
relatively significant when the factor (n — p) decreases. 

As we can see from the preliminary study, the analysis of stable States of 
memory can become involved. In addition, firm conclusions are hard to derive 
unless statistical methods of memory evaluation are employed. The next section 
will focus on the performance analysis of autoassociative recurrent memories. 
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6.4 — 

PERFORMANCE ANALYSIS 
OF RECURRENT 
AUTOASSOCIATIVE MEMORY 

In this section relationships will be presented that relate the size of the 
memory n to the number of distinet pattems that can be efficiently recovered. 
These also depend on the degree of similarity that the initializing key vector has 
to the closest stored vector and on the similarity between the stored pattems. We 
will look at example performance and capacity, as well as the fixed points of 
associative memories. 

As mentioned, associative memories recall pattems that display a degree of 
“similarity” to the search argument. To measure this “similarity” precisely, the 
quantity called the Hamming distance (HD) is often used. Strictly speaking, the 
Hamming distance is proportional to the dissimilarity of vectors. It is defined 
as an integer equal to the number of bit positions differing between two binary 
vectors of the same length. 

For two tf-tuple bipolar binary vectors x and y, the Hamming distance is 
equal: 

HD(x,y)= \i tø- ;y,| (6.31) 

Obviously, the maximum HD value between any vectors is n and is the distance 
between a vector and its complement. Let us also notice that the asynchronous 
update allows for updating of the output vector by HD = 1 at a time. The follow- 
ing example depicts some of the typical occurrences within the autoassociative 
memory and focuses on memory State transitions. 


EXAMPLE 6.2 

Let us look at the design and convergence properties of an autoassociative 
memory that stores the following two vectors s (1 \ s (2) : 

s (1) =[-l -1 1 1]' 

s (2) =[-l 1 1 -1]' 


The weight matrix computed according to (6.20) results: 


r o 



o 


o 

o 

o 

-2 


-2 

0 

0 
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0 ' 

-2 

0 

o 


EH 


Mil 


I wkåk 


The energy function computed from (6.19) is readily obtainable as 

E(\) = 2(vj V 3 + v 2 v 4 ) 

When the network is allowed to update, it evolves through certain States, 
which are shown on the State transition diagram in Figure 6.11. States are 
vertices of the four-dimensional cube. The figure shows all possible asyn- 
chronous transitions and their directions. Only transitions between States 
have been marked on the graph; the self-sustaining or stable State transi- 
tions have been omitted for clarity of the picture. Since only asynchronous 
transitions are allowed for autoassociative recurrent memory, every vertex is 
connected by an edge only to the neighboring vertex differing by a single bit. 
Thus, there are only four transition paths connected to every State. The paths 
indicate directions of possible transitions. As explained earlier, transitions 
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for the update rule (6.17) are toward lower energy values. The energy values 
for each State have been marked in parentheses at each vertex. 

Let us review several sample transitions. Starting at v° = [ — 1 1 — 1 1 ]', 
and with nodes updating asynchronously in ascending order, we have the 
State sequence 

v 1 - [1 1 -1 1]' 

v 2 =[l -1 -1 1]' 

v 3 = V 4 = ... = [1 -1 -1 1]' 

as it has been marked on the State diagram. The actual convergence is 
toward a negative image of the stored pattem s (2) . This result should have 
been expected. The initial pattem had no particular similarity to any of the 
stored pattems and apparently did not fali into any closeness category of 
pattems discussed in the previous section. In faet, the selected initializing 
pattem v° has an HD value of 2 to each of the stored pattems. As such, the 
HD value is equal to half of the input vector dimension and the memory 
does not recover any of the stored vectors s (l) or s (2) . 

Let us look at a regeneration of a key vector v° = [ — 1 1 1 1 ] ‘ 

which is at HD = 1 to both s (l> and s (2) . The transition starting in ascending 
order leads to s (1) in single step 

v 1 = v° 

v 2 — [ 1 -1 1 1]' 

v 3 = v 2 
v = v , etc. 

The reader can verify that by carrying out the transitions initialized at v°, 
but carried out in descending node order, pattem s (2) can be recovered. mm 

The reader has certainly noticed that the example autoassociative memory 
just discussed has a number of problems, the most serious of them being uncer- 
tain recovery. Indeed, the memory in this example has been heavily overloaded 
since p/n is 50%. Overloaded memory by design does not provide error-free or 
efficient recovery of stored pattems. Another problem encountered in the memory 
is its unplanned stable States. Evaluation of similar examples of slightly larger 
memories would make it possible to find out that stable States exist that are not 
the stored memories. Such purposely not encoded but physically existent States of 
equilibrium are called spurious memories. Spurious memories are caused by the 
minima of the energy funetion that are additional to the ones we want. We could 
also find examples of convergence that are not toward the closest memory as 
measured with the HD value (see Problem P6.10). These are further drawbacks 
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of the original model since the memories may not all be fixed points. These 
issues become particularly important near or above the memory capacity. 


Energy Function Reduction 

The energy function (6.19) of the autoassociative memory decreases during 
the memory recall phase. The dynamic updating process continues until a local 
energy minimum is found. Similar to continuous-time systems, the energy is 
minimized along the following gradient vector direction: 

V v £(v) = -Wv (6.32a) 

As we will see below, the gradient (6.32a) is a linear function of the Hamming 
distance between v and each of the p stored memories (Petsche 1988). By sub- 
stituting (6.20a) into the gradient expression (6.32a), it can be rearranged to the 
form 

V v E(v) = - £ s (m) 

m= I 

where the scalar product s (m)e v has been replaced by the expression in brackets 
(see Appendix). The components of the gradient vector, V n -E(v), can be obtained 
directly from (6.32b) as 

= -n £ s\ m) + 2 £ •s- m) HD(s (m \ v) + pv t (6.32c) 

dv ; - m=l m = i 

Expression (6.32c) makes it possible to explain why it is difficult to recover 
pattems v at a large Hamming distance from any of the stored pattems s^ m \ m = 
1, 2, ...,/?. 

When bit i of the output vector, v,-, is erroneous and equals - 1 and needs 
to be corrected to +1, the fth component of the energy gradient vector (6.32c) 
must be negative. This condition enables appropriate bit update while the energy 
function value would be reduced in this step. From (6.32c) we can notice, how- 
ever, that any gradient component of the energy function is linearly dependent 
on HD (s (w) , v), for m = 1,2,...,/?. The larger the HD value, the more difficult 
it is to ascertain that the gradient component indeed remains negative due to the 
large potential contribution of the second sum term to the right side of expression 
(6.32c). Similar arguments against large HD values apply for correct update of 
bit V, = 1 toward —1 which requires positive gradient component dE(\)/dv i . 

Let us characterize the local energy minimum v* using the energy gra- 
dient component. For autoassociative memory discussed, v* constitutes a local 
minimum of the energy function if and only if the condition holds that 
v f **(d£/dv;)| v * < 0 for all i = 1, 2, . . . , n. The energy function as in (6.19) can 


n - 2HD (s (m \ v) ] + p\ (6.32b) 
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Figure 6.12 Three possible energy function cross sections along the v, axis: (a) positive slope, 
minimum at -1, (b) negative slope, minimum at 1, and (c) zero slope, no unique minimum. 


be expressed as 

E(v) = -v ; i WijVj - \ i 2 W ’ kj v k Vj (6.33a) 

jM j= 1 

j±i 

where the first term of (6.33a) is linear in v,- and the second term is constant. 
Therefore, the slope of E(v,) is a constant that is positive, negative, or zero. This 
implies that one of the three conditions applies at the minimum v* 

v,* = -1 (6.33b) 

v ; * = 1 (6.33c) 

V,-* = ±1 (6.33d) 

The three possible cases are illustrated in Figure 6.12. The energy function is 
minimized for v,-* = — 1 (case a) or for v,-* = 1 (case b). Zero slope of the en- 
ergy, or gradient component equal to zero (case c), implies no unique minimum 
at either +1 or -1. 


dE 

dVj 

dE 

av,- 

dE 

av,- 


> 0 , 


< 0 , 


o. 


Capacity of Autoassociative 
Recurrent Memory 

One of the most important performance parameters of an associative mem- 
ory is its capacity. Detailed studies of memory capacity have been reported in 
by McEliece et al. (1987) and Komlos and Paturi (1988). A State vector of the 
memory is considered to be stable if v * +1 = TfWv*] provided that v * +1 = v*. 
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Note that the definition of stability is not affected by synchronous versus asyn- 
chronous transition mode; rather, the stability concept is independent from the 
transition mode. 

A useful measure for memory capacity evaluation is the radius of attr action 
p, which is defined in terms of the distance pn from a stable State v such that 
every vector within the distance pn eventually reaches the stable State v. It is 
understood that the distance pn is convenient if measured as a Hamming distance 
and therefore is of integer value. For the reasons explained earlier in the chapter 
the radius of attraction for an autoassociative memory is somewhere between 
1 /n and 1/2, which corresponds to the distance of attraction between 1 and n/ 2. 

For the system to function as a memory, we require that every stored mem- 
ory be stable. Somewhat less restrictive is the assumption that there is at 

least a stable State at a small distance en from the stored memory where s is 
a positive number. In such a case it is then still reasonable to expect that the 
memory has an error correction capability. For example, when recovering the 
input key vector at a distance pn from stored memory, the stable State will be 
found at a distance en from it. Note that this may still be an acceptable output 
in situations when the system has leamed too many vectors and the memory of 
each single vector is faded. Obviously, when e = 0, the stored memory is stable 
within a radius of p. 

The discussion above indicates that the error correction capability of an 
autoassociative memory can only be evaluated if stored vectors are not too close 
to each other. Therefore, each of the p distinet stored vectors used for a capacity 
study are usually selected at random. The asymptotic capacity of an autoassocia- 
tive memory consisting of n neurons has been estimated in by McEliece et al. 
(1987) as 


= (1 - 2 pfn 
4 ln n 


(6.34a) 


When the number of stored pattems p is below the capacity c expressed as in 
(6.34a), then all of the stored memories, with probability near 1, will be stable. 
The formula determines the number of key vectors at a radius p from the stored 
memory that are correctly recallable to one of the stable, stored memories. The 
simple stability of the stored memories, with probability near 1, is ensured by 
the upper bound on the number p given as 


n 

41n n 


(6.34b) 


For any radius between 0 and 1 / 2 of key vectors to the stored memory, almost 
all of the c stored memories are attractive when c is bounded as in (6.34b). If a 
small fraction of the stored memories can be tolerated as unrecoverable, and not 
stable, then the capacity boundary c can be considered twice as large compared 
to c computed from (6.34b). In summary, it is appropriate to State that regardless 
of the radius of attraction 0 < p < 1 / 2 the capacity of the Hopfield memory is 
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bounded as follows 

n n v 

— — < c < — — (6.34c) 

4 ln « 2 ln « 

To offer a numerical example, the boundary values for a 100-neuron network 
computed from (6.34c) are about 5.4, with 10.8 memory vectors. 

Assume that the number of stored pattems p is kept at the level an, for 
0 < a < 1, and n is large. It has been shown that the memory still functions 
efficiently at capacity levels exceeding those stated in (6.34c) (Amit, Gutfreund, 
and Sompolinsky 1985). When a = 0.14, stable States are found that are very 
close to the stored memories at a distance 0.03«. As a decreases to zero, this 
distance decreases as exp(-(l /2)a). Hence, the memory retrieval is mostly ac- 
curate for p < 0.14«. A small percentage of error must be tolerated though if 
the memory operates at these upper capacity levels. 

The study by McEliece et al. (1987) also reveals the presence of spurious 
fixed points, which are not stored memories. They tend to have rather small basins 
of attraction compared to the stored memories. Therefore, updates terminate in 
them if they start in their vicinity. 

Although the number of distinet pattem vectors that can be stored and per- 
feetly recalled in Hopfield’s memory is not large, the network has found a number 
of practical applications. However, it is somewhat peculiar that the network can 
recover only c memories out of the total of 2 n States available in the network as 
the cube comers of «-dimensional hypercube. 


Memory Convergence versus Corruption 

To supplement the study of the original Hopfield autoassociative memory, 
it is worthwhile to look at the actual performance of an example memory. Of 
particular interest are the convergence rates versus memory parameters discussed 
earlier. Let us inspect the memory performance analysis curves shown in Figure 
6.13 (Desai 1990). The memory performance on this figure has been evaluated 
for a network with « = 120 neurons. As pointed out earlier in this section, 
the total number of stored pattems, their mutual Hamming distance and their 
Hamming distance to the key vector determine the success of recovery. Figure 
6.13(a) shows the percentage of correct convergence as a funetion of key vector 
corruption compared to the stored memories. Computation shown is for a fixed 
HD between the vectors stored of value 45. It can be seen that the correct 
convergence rate drops about linearly with the amount of corruption of the key 
vector. The correct convergence rate also reduces as the number of stored pattems 
inereases for a fixed distortion value of input key vectors. The network performs 
very well at p = 2 pattems stored but recovers rather poorly distorted vectors at 
p — 16 pattems stored. 
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Convergence vs. Corruption 
NetWork Type: Hopfield Memory 
NetWork Parameters: Dimension — n (120) 

Threshold — Hard Limited 
@ Hamming Distance HD = 45 
Curve Parameters: Pattems P 
Features: 20 samples per point. 

Comments: As can be seen from the curves the performance is of good quality for corrup- 
tion levels up to 25% with a capacity of 0.04 X n only. The noise tolerance 
becomes poor as the number of pattems approaches the capacity of 18. 


(a) 

Figure 6.13a Memory convergence versus corruption of key vector: (a) for a different number 
of stored vectors, HD = 45. 


Figure 6.13(b) shows the percentage of correct convergence events as a 
function of key vector corruption for a fixed number of stored pattems equal 
to four. The HD between the stored memories is a parameter for the family 
of curves shown on the figure. The network exhibits high noise immunity for 
large and very large Hamming distances between the stored vectors. A gradual 
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% Convergence vs % Corruption 



% Corruption 

O HD=30 A HD =45 OHD=60 X HD=90 


Convergence vs. Corruption 

NetWork Type: Hopfield Memory 

NetWork Parameters: Dimension — n (120) 

Threshold — Hard Limited 
Pattems p — 4 

Curve Parameters: Hamming Distance 

Features: 20 samples per point. 

Comments: This network shows excellent performance and is extremely insensitive to noise 
for corruption levels as high as 35% at a Hamming Distance of 60 between 
the stored prototypes. An abrupt degradation in performance is observed for 
prototypes having more than three quarter of their bits in common. 


(b) 

Figure 6.13b Memory convergence versus corruption of key vector (continued): (b) for different 
HD values, four vectors stored. 


degradation of initially excellent recovery can be seen as stored vectors become 
more overlapping. For stored vectors that have 75% of the bits in common, the 
recovery of correct memories is shown to be rather inefficient. 

To determine how long it takes for the memory to suppress errors, the 
number of update cycles has also been evaluated for example recurrences for 
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% Corruption 

mmp = 4 es 3?=8 a P = 12 


Iterations vs. Corruption 

NetWork Type: Hopfield Memory 

NetWork Parameters: Dimension — n (120) 

Threshold — Hard Limited 
@ Hamming Distance HD = 45 
Curve Parameters: Number of Iterations 
Features: 20 samples per point. 

Comments: The number of iterations during retrieval is fairly low for corruption levels 

below 20%. It increases roughly in proportion to the number of pattems stored. 


(c) 


Figure 6.13c Memory convergence versus corruption of key vector (continued): (c) the number 
of sweep cycies for different corruption levels. 


the discussed memory example. The update cycle is understood as a full sweep 
through all of the n neuron outputs. The average number of measured update 
cycies has been between 1 and 4 as illustrated in Figure 6.13(c). This number 
increases roughly linearly with the number of pattems stored and with the percent 
corruption of the key input vector. 
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Fixed Point Concept 

We now look in more detail at the convergence of recurrent autoassociative 
networks. Since the energy function E(\) of Equation (6.19) is bounded from 
below, the network evolves under the asynchronous dynamics toward E(v* +1 ) 
such that 

£(v* +1 ) <£(v*) (6.35a) 

Since vectors v^ +1 and v* differ during the memory transitions by at most a 
single component (bit), the stabilization of E(y k+l ) means that 

E(\ k+l ) = E(y k ) for k > k 0 (6.35b) 

The transitions stop at the energy minimum, which also implies that 

v* +1 = v* (6.35c) 

Thus the network reaches its fixed point v*, or stable State, at the energy mini- 
mum E(\ k ). We also know from the energy function study that the only attractors 
of the discussed network are its fixed points. Limit cycles such as B shown in 
Figure 6.4(c) are thus excluded in these type of networks. 

It is instructive to devote more attention to the fixed-point concept, because 
it sheds more light on memory performance. Let us begin with a discussion of 
the one-dimensional recurrent system shown in Figure 6.14. For the output at 
t = k + 1 we have 

v* +1 =f(yv,y k ) (6.36a) 

The fixed point is defined at v* if the following relationship holds: 

v* = /(v*) (6.36b) 

where it has been assumed that network parameter w in (6.36a) is constant. In 
geometrical terms the fixed point is found at the intersection of function /(v) 
and v. Whether or not this point is the solution of recursive Equation (6.36a) 
remains, however, an open question. In terms of the recursive formula (6.36a), 
the fixed point is said to be stable if 

lim v* = v*, or (6.37a) 

k — ►« 

lim e k+ ' = 0 (6.37b) 


V* * 

f(w, V k ) 



1 







Figure 6.14 One-dimensional recurrent network. 
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where e is the recursion error defined as 

e k+\ 4 | v *+i _ v *| 

The recursion error can be expressed using (6.36a) as 

e * +1 = \f(v k ) ~ v*| 
and further rearranged to the form 

e k + 1 _ / [v* + (v* - V*)] -/(V*) 

yk y* 

For a differentiable function / responsible for network’s feedforward processing, 
this further reduces to 

e k+ ' s |/'(v*) | e k (6.38c) 

Now the condition (6.38c) translates to the form 

|/'(v)| < 1 (6.39) 


(6.38a) 

(6.38b) 


which is the sufficient condition for the existence of a stable fixed point in the 
neighborhood where condition (6.39) holds. We have derived the condition for 
stable fixed-point existence in a simple one-dimensional case. 

Let us look at an example of two systems and analyze the existence of 
stable fixed points. Figure 6.15(a) depicts a curve (Serra and Zanarini 1990) that 
serves as an example: 


f(v) = w 



(6.40) 


The curve is called a triangular map and is made for the parameter w = 1/3, 
0 < v < 1. The function has only a single stable fixed point, v* = 0. The several 
successive recursions shown indicate this clearly. Moreover, condition (6.39) is 
also fulfilled for the entire neighborhood 0 ^ v < 1. 

Another triangular map made for a different value of the triangular map 
parameter, w = 2, is shown in Figure 6.15(b). Two unstable fixed points are 
found for this case. They are at v = 0 and v = 2/3. The fixed point instability 
can be noticed from displayed recursive movements of the solution starting at v° 
and v°. The reader can verify that performing recursions 

V t+1 = /(v*) = 2^1—2 I-v*) 

starting either at v° or at v°, the recursive updates are not converging to any of 
the solutions that clearly exist at v = 0 and v = 2/3. 

As we have seen, even a simple one-dimensional recurrent network exhibits 
interesting properties. The network from Figure 6.14 may be looked at as a single 
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Figure 6.15 Illustration for the example one-dimensional recurrent network performance: 
(a) stable fixed point at 0 and (b) unstable fixed points at 0, and 2/3. 


computing node, which illustrates the discussion of the one-dimensional case just 
concluded. The recurrent progression of outputs in a multidimensional recurrent 
network is much more complex than shown. There are, however, fixed-point 
attractors in such memories similar to the ones discussed. 

Figure 6.16 shows a single-neuron recurrent memory that has two stable 
fixed points at A and B, and one unstable stationary point at the origin. Although 
this network does have self-feedback, it operates with a continuous activation 
function, and thus it does not totally qualify as an associative memory by our 
criteria. However, it is useful for understanding fixed-point recursion mechanisms, 
which make stable or unstable neural networks. In limit case of high feedfor- 
ward gain, the network shown in Figure 6.16 but using a hard-limiting activation 
function would stabilize at output +1 or -1. Note that the network from Figure 
6.16(a) is a disguised bistable flip-flop from Figure 6.8(a). It employs a single 
non-inverting neuron rather than two inverting ones in Cascade. Their operating 
principle concept, behavior and final outcome are the same. 


Modified Memory Convergent 
Toward Fixed Points 

Recall from the previous section that in several basic cases of memory 
studies we have been able to show that net = cs^ m \ where c is a positive 
constant and s (m ^ is one of the stored pattems. With respect to the discussion of 
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(a) (b) 

Figure 6.16 One-neuron recurrent network: (a) diagram and (b) fixed points. 


stable fixed points, these cases could be understood as stable recurrent solutions 
for fixed points such that 

lim v* +1 = s (m,) (6.41) 

k— 


Assume that only a single pattem s (m) has been stored in an autoassociative 
memory. The recurrent update of the f th node requires that 


and it reduces to 


£+1 _ V _ c (™') 

i S i S j S j S i 


(6.42a) 


vf +1 = (n ~ l)sj m ' y (6.42b) 

Thus, the vector s (m ‘ can be considered a stable fixed point in the memory. 

When the network has been taught more than a single pattem, the pattems 
stored remain as fixed points, but a new stable State might also appear. In the 
case for which the number of stored pattems exceeds two, it is not even possible 
to guarantee that they, in faet, remain as fixed points (Serra and Zanarini 1990). 
The substantiation of the property is rather tedious, however, and is thus omitted. 

The discussion below provides some additional insight into complex recur- 
rent processes within the discussed class of memory networks. Realizing the 
limitations of the original model for autoassociative memory, the original Heb- 
bian leaming rule with the diagonal nullification used throughout Sections 6.3 
and 6.4 can be modified. This modification results in a memory with improved 
properties. A memory recording method is presented below that makes each of 
the pattems an independent equilibrium point. 
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Observe that a pattem is a fixed point if and only if the following 
inequality holds: 

^ sf' ) > 0, for i = 1, 2, . . . , n (6.43) 

An alternative design procedure for the memory can be devised based on (6.43)* 
To implement the condition that each of the p pattems be a fixed point, nX p 
inequalities such as (6.43) must be fulfilled. Note that (6.43) holds when the 
following condition holds: 

Ws (m) = s (m) ) for m = 1, 2, . . . , p (6.44a) 

Condition (6.44a) can be written compactly as a single matrix expression 

WS = S (6.44b) 

where 

S = [s (1) s® ... s ( P>] 

To accomplish the modified memory design, we are interested in tinding the 
matrix W that fulfills (6.44b). The matrix equation (6.44b) is actually a system of 
np linear equations for solving n 2 unknowns which are entries of the W matrix. 
Note that for memory that is not overloaded, p should always be smaller than 
n. The linear system (6.44b) is thus underdetermined, so there is a solution for 
W that approximates (6.44b). One of the possible solutions to (6.44) has been 
presented by Personnaz, Guy on, and Dreyfus (1986) and it is reviewed below. 
The resulting solution weight matrix is 

W = SS + (6.45) 

where S + is the Moore-Penrose pseudoinverse matrix (see Appendix). For lin- 
early independent vectors s m , the matrix S + can be expressed as 

s + = (s'sr's' 

It can be seen that since the matrix S is of size n X p, the computation of W 
as in (6.45) involves the inversion of a p X p matrix. The method presented is 
called the projection learning rule and it departs from the conventional Hebb’s 
leaming rule for weight matrix entries (Michel and Farrell 1990). Interestingly, 
formula (6.45) simplifies to the Hebbian leaming rule for orthogonal vectors to 
be stored. Indeed, using the orthogonality condition we have 

S'S = I (6.46) 

which yields from (6.45) 

W = SS' (6.47) 

This result for the weight matrix exactly coincides with Hebbian leaming rule 
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(6.14b). In conclusion, the projection leaming rule offers an improvement over 
the Hebbian rule in the sense that each of the stored memories is a stable fixed 
point. Thus, the method guarantees that the stored vectors correspond to stable 
solutions. Note that both leaming mies lead to symmetric matrices W. The dis- 
advantage of the projection leaming rule is that the regions of attraction for each 
pattem become significantly smaller for a larger number of pattems p stored 
within a given size network. For p approaching n/ 2, the decrease of memory 
convergence quality is rather drastic. 


Advantages and Limitations 

Theoretical considerations and examples of memory networks discussed in 
this chapter point out a number of advantages and limitations. As we have seen, 
recuirent associative memories, whether designed by the Hebbian leaming rule 
or by a modified rule, suffer from substantial capacity limitations. Capacity lim- 
itation causes diversified symptoms. It can amount to convergence to spurious 
memories and difficulties with recovery of stored pattems if they are close to 
each other in the Hamming distance sense. Overloaded memory may not be able 
to recover data stored or may recall spurious outputs. Another inherent problem 
is the memory convergence to stored pattem complements. 

In spite of all these deficiencies, the Hopfield network demonstrates the 
power of recurrent neural processing within a parallel architecture. The recur- 
rences through the thresholding layer of processing neurons tend to eliminate 
gradually noise superimposed on the initializing input vector. This coerces the 
incorrect pattem bits toward one of the stored memories. The network’s computa- 
tional ability makes it possible to apply it in speech processing, database retrieval, 
image processing, pattem classification and other helds. The electronic implemen- 
tations of Hopfield’ s associative memories will be discussed in Chapter 9. 


BIDIRECTIONAL 
ASSOCIATIVE MEMORY 

- Bidirectional associative memory is a heteroassociative, content-addressable 
memory consisting of two layers. It uses the forward and backward information 
flow to produce an associative search for stored stimulus-response association 
(Kosko 1987, 1988). Consider that stored in the memory are p vector association 
pairs known as 


{(a< l >.b' I >), (a«,b< 2 >) (»w.b«)} 


(6.48) 
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When the memory neurons are activated, the network evolves to a stable 
State of two-pattem reverberation, each pattem at output of one layer. The stable 
reverberation corresponds to a local energy minimum. The network 's dynamics 
involves two layers of interaction. Because the memory processes information 
in time and involves bidirectional data flow, it differs in principle from a linear 
associator, although both networks are used to store association pairs. It also 
differs from the recurrent autoassociative memory in its update mode. 


Memory Architecture 

The basic diagram of the bidirectional associative memory is shown in Fig- 
ure 6.17(a). Let us assume that an initializing vector b is applied at the input to 
the layer A of neurons. The neurons are assumed to be bipolar binary. The input 
is processed through the linear connection layer and then through the bipolar 
threshold functions as follows: 


a' = T[Wb] (6.49a) 

where T[*] is a nonlinear operator defined in (6.5). This pass consists of matrix 
multiplication and a bipolar thresholding operation so that the fth output is 

a\ = sgn ^ X w ij b j S j » for i = 1, 2, . . . , n (6.49b) 

Assume that the thresholding as in (6.49a) and (6.49b) is synchronous, and the 
vector a ' now feeds the layer B of neurons. It is now processed in layer B 
through similar matrix multiplication and bipolar thresholding but the processing 
now uses the transposed matrix W of the layer B : 

b' - T[WV] (6.49c) 


or for the j’th output we have 

b'j = sgn Wjjafj , for j = 1, 2, . . . , m (6.49d) 

From now on the sequence of retrieval repeats as in (6.49a) or (6.49b) to compute 
a", then as in (6.49c) or (6.49d) to compute b", etc. The process continues until 
further updates of a and b stop. It can be seen that in terms of a recursive update 
mechanism, the retrieval consists of the following steps: 

First Forward Pass: a 1 = T[Wb°] 

First Backward Pass: b 2 = TfW^a 1 ] 

Second Forward Pass: a 3 = T[Wb 2 ] (6.50) 


k / 2'th Backward Pass: b* = r[W'a A 1 ] 


/ 


i 
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Figure 6.17 Bidirectional associative memory: (a) general diagram and (b) simplified diagram. 
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Ideally, this back-and-forth flow of updated data quickly equilibrates usually in 
one of the fixed pairs (a ( *\b (l) ) from (6.48). Let us consider in more detail the 
design of the memory that would achieve this aim. 

Figure 6.17(b) shows the simplified diagram of the bidirectional associative 
memory often encountered in the literature. Layers A and B operate in an altemate 
fashion — first transferring the neurons’ output signals toward the right by using 
matrix W, and then toward the left by using matrix W f , respectively. 

The bidirectional associative memory maps bipolar binary vectors a = 
[a { a 2 . . . a n ]*, a { — ±1, i = 1, 2, ..., n, into vectors b = [b x b 2 . . . b m ]\ 
b t — il, i = 1 , 2, ..., m, or vice versa. The mapping by the memory can 
also be performed for unipolar binary vectors. The input-output transformation 
is highly nonlinear due to the threshold-based State transitions. 

For proper memory operation, the assumption needs to be made that no State 
changes are occurring in neurons of layers A and B at the same time. The data 
between layers must flow in a circular fashion: A — > B — ► A, etc. The convergence 
of memory is proved by showing that either synchronous or asynchronous State 
changes of a layer decrease the energy. The energy value is reduced during a 
single update, however, only under the update rule (5.7). Because the energy of 
the memory is bounded from below, it will gravitate to fixed points. Since the 
stability of this type of memory is not affected by an asynchronous versus syn- 
chronous State update, it seems wise to assume synchronous operation. This will 
result in larger energy changes and, thus, will produce much faster convergence 
than asynchronous updates which are serial by nature and thus slow. 

Figure 6.18 shows the diagram of discrete-time bidirectional associative 
memory. It reveals more functional details of the memory such as summing 
nodes, TLUs, unit delay elements, and it also introduces explicitly the index of 
recursion k. The figure also reveals a close relationship between the memory 
shown and the single-layer autoassociative memory discussed in Section 6.4. If 
the weight matrix is square and symmetric so that W = W r , then both memories 
become identical and autoassociative. 


Association Encoding and Decoding 

The coding of information (6.48) into the bedirectional associative memory 
is done using the customary outer product rule, or by adding p cross-correlation 
matrices. The formula for the weight matrix is 

W = £ a (i) b (,) ' (6.51a) 

1 = 1 

where a w and b w are bipolar binary vectors, which are members of the fth 
pair. As shown before in (6.8), (6.51a) is equivalent to the Hebbian leaming rule 
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(W), layer A (W r ), layer B 

Figure 6.18 Discrete-time bidirectional associative memory expanded diagram. 


yielding the following weight values: 

w tJ = £ (6.51b) 

m= 1 

Suppose one of the stored pattems, a (m \ is presented to the memory. The 
retrieval proceeds as follows from (6.49a) 


b = T £ (b (m) a (m)? ) 


Åm') 


which further reduces to 

b = r 


nb (m ) + I b (m) a (m), a (ra ' 


(6.52a) 


(6.52b) 


The net ft vector inside brackets in Equation (6.52b) contains a signal term 
nb (m) additive with the noise term kj of value 


TI = 


£ b (m) (a (m) 'a (m '> 

m^m' ' 


(6.53) 


Assuming temporarily the orthogonality of stored pattems a (m) , for m = 1, 
2, . . . , p, the noise term reduces to zero. Therefore, immediate stabilization 
and exact association b = b (w } occurs within only a single pass through layer B. 
If the input vector is a distorted version of pattem a (m \ the stabilization at b (w ^ 
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is not imminent, however, and depends on many factors such as the HD between 
the key vector and prototype vectors, as well as on the orthogonality or HD 
between vectors b (/) , for i = 1, 2, . . . , p. 

To gain better insight into the memory performance, let us look at the noise 
term iq as in (6.53) as a function of HD between the stored prototypes a (m \ for 
m = 1, 2, ...» p. Note that two vectors containing ±1 elements are orthogonal 
if and only if they differ in exactly n/2 bits. Therefore, HD (a (w) ,a (m,) ) = n/2, 
for m = 1,2, . . . , p, m m ! , then r\ = 0 and perfect retrieval in a single pass 
is guaranteed. 

If a (w) , for m = 1,2,...,/?, and the input vector a (m) are somewhat similar 
so that HD (a (m) ,a (m,) ) < n/2, for m — 1, 2, . . . , p, m # m f , the scalar products 
in parentheses in Equation (6.53) tend to be positive, and a positive contribution 
to the entries of the noise vector T| is likely to occur. For this to hold, we need to 
assume the statistical independence of vectors b (m) , for m = 1, 2, . . . , p. Pattem 
b( m ') thus tends to be positively amplified in proportion to the similarity between 
prototype pattems a (/w) and a (m ) . If the pattems are dissimilar rather than similar 
and the HD value is above n/2, then the negative contributions in parentheses in 
Equation (6.53) are negatively amplifying the pattem b (/n,) . Thus, a complement 
— b (m ) may result under the conditions described. 


Stability Considerations 

Let us look at the stability of updates within the bidirectional associative 
memory. As the updates in (6.50) continue and the memory comes to its equi- 
librium at the k'th step, we have a* — > b* +1 — ► a k+2 , and a^ +2 = a*. In such 
a case, the memory is said to be bidirectionally stable . This corresponds to the 
energy function reaching one of its minima after which any further decrease of 
its value is impossible. Let us propose the energy function for minimization by 
this system in transition as 

E(a,b) = ~^a'Wb - ^b'W'a (6.54a) 

The reader may easily verify that this expression reduces to 

E(a,b) = -a'Wb (6.54b) 

Let us evaluate the energy changes during a single pattem recall. The sum- 
mary of thresholding bit updates for the outputs of layer A can be obtained from 
(6.49b) as 

(2 >0 

m 

Aa, = < O for ± Wybj = 0 

i=l,2 n 2=1 

l -2 


< 0 


(6.55a) 
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and for the outputs of layer B they result from (6.49d) 

r 2 , 

A ^ = < 0 for I Wijdi = 

[ _ 2 <-l < 

The gradients of energy (6.54b) with respect to a and 
spectively, as 


as 

0 

0 

0 

b 


(6.55b) 

can be computed, re- 


V„£(a,b) = -Wb (6.56a) 

V b E(a, b) = - W'a (6.56b) 


The bitwise update expressions (6.55) translate into the following energy changes 
due to the single bit increments Aa, and A bf. 


A£ a ,(a,b) 



for i = 1,2,...,« 


(6.57a) 


A£fy(a, b) = -| 


S w, y a, 
i=l ; 


A bj, fory=l, 2, 


m 


(6.57b) 


Inspecting the right sides of Equations (6.57) and comparing them with the 
ordinary update rules as in (6.55) lead to the conclusion that AE < 0. As with 
recurrent autoassociative memory, the energy changes are nonpositive. Since E 
is a bounded function from below according to the following inequality: 


n m 


£(a,b) >-H Kyl 

>=U= i 


(6.58) 


then the memory converges to a stable point. The point is a local minimum of the 
energy function, and the memory is said to be bidirectionally stable. Moreover, no 
restrictions exist regarding the choice of matrix W, so any arbitrary real nXm ma- 
trix will result in bidirectionally stable memory. Let us also note that this discus- 
sion did not assume the asynchronous update for energy function minimization. 
In faet, the energy is minimized for either asynchronous or synchronous updates. 


Memory Example and 
Performance Evaluation 


EXAMPLE 6.3 


In this example we demonstrate coding and retrieval of pairs of pattems. 
The memory to be designed needs to store four pairs of associations. Four 
16-pixel bit maps of letter characters need to be associated to 7-bit binary 


6.5 BIDIRECT10NAL ASSOCIATIVE MEMORY 


361 


vectors as shown in Figure 6.19(a). The respective training pairs of vectors 
(a®b w ), / = 1, 2, 3, 4 are: 


. 0 ) = 


[ 111-11 


1 -1 - 11 - 1 - 1-1111 - 1 ]', 
b (1) = [1-1-1-1-11 1]' 
a (2) = [ 1 -1-1111-111-1111-1-11]', 
b <2) = [1-1-1111 -1]' 
a (3) = [1111-1-11-1-11-1-11111]', 
b (3) = [1-111-11-1]' 

a (4) = [-1 -1 -1 -1 -1 -1 -1 1 1 1 1 1 -1 -1-1 1]', 
b< 4) = [-111-11-11]' 

The weight matrix can be obtained from (6.51) as 


( 6 . 59 ) 
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( 6 . 60 ) 


Assume now that the key vector a 1 at the memory input is a distorted 
prototype of a l2) as shown in Figure 6.19(b) so that HD (a (l) , a 1 ) = 4. 

a 1 = [-1 -1-11-11-111-111-1-11 1]' 

The memory response, b 2 , is computed from (6.50): 

b 2 = T[-16 16 0 0 32 -16 0]' 

and the thresholding by the operator T yields 

b 2 = [-1 1 0 0 1 -1 0]' 

Continuing the retrieval process until the bidirectional stability is reached 
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Pair (1) 


Pair (3) 


h m- 

HZHBUC] b - 

Pair (2) Pair (4) 




(a) 


(b) 


Figure 6.19 Example of bidirectional associative memory: (a) associations stored and (b) four 
steps of retrieval of pair 4. 


within this memory, the updates are computed of following value: 

a 3 = T [ -14 -10 -10 -6 -6 2 -2 10 2 6 10 10 -14 -10 -10 2 ]‘ 

a 3 = [-1 -1-1-1-11-111111-1 -1 l] f 

b 4 = T [ —28 28 8 -8 40 -28 8]' 

b 4 = [-1 1 1 -1 1 -1 1]' 

a 5 = T [ —20 -10 -10 -14 -10 -4 -4 

10 4 10 10 10 -20 -10 -10 0]' 
a 5 = [-1 -1 -1 -1-1-1-11111 1 -1 -1 -1 1]' 
b 6 = b 4 
a 7 = a 5 

The corresponding sequence of pattems is shown in Figure 6.19(b). The 
figure shows four first steps of retrieval. Interestingly, a 1 has converged to 
the prototype pair (a (4) ,b (4) ). Let us note, however, that the HD between a 1 
and a (4) is 

HD (a',a (4) ) = 4 

We can also notice that other HD values of interest for our case are 


HD | 


= 12 

HD 

(•',a< 21 ) 

1 = 4 

HD ( 

V.a®) 

= 10 
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We, therefore, can conclude that the memory has converged to one of the 
two closest prototypes in the Hamming distance sense. m 

Kosko (1988) has shown that the upper limit on the number p of pattem 
pairs which can be stored and successfully retrieved is min (n,m). The substan- 
tiation for this estimate is rather heuristic. It can be statistically expected that if 
p > n, the noise term r\ of Equation (6.53) exceeds the signal term. Similarly, 
for processing of a signal and noise mixture by layer B, it is expected that the 
dominance of the signal term over the noise component is maintained for p < m. 
Hence, a rough and heuristic estimate on memory storage capacity is 

p < min (m,n) (6.61a) 

A more conservative heuristic capacity measure (6.61b) is also used in the liter- 
ature 

p = y/mhi Jrn^n) (6.61b) 

Like the autoassociative recurrent memory, the performance of bidirectional 
associative memory depends on the properties of the stored pattems, and its 
performance varies vastly for different prototype pairs. Figure 6.20 shows the 
simulated convergence versus corruption curves for a bidirectional associative 
memory with n = 25 and m = 9. Randomly generated training pattems a (l) had 
an average HD .= 11 between them. The figure shows four curves for different 
storage levels. The results of the experiment indicate that the memory perfor- 
mance obviously decreases with the increase of p , and also with the increase 
in search argument distortion. But as can be seen from the figure for p — 8, 
the memory filled near its capacity routinely fails to identify or recall its true 
memories stored, even with the association argument undistorted. 

In addition, the curves show that a memory of this size and under the 
conditions described is susceptible to noise in general. Indeed, excellent recall 
is shown for corruption levels limited to below 10%. This is relatively lower 
noise immunity than for the autoassociative recurrent memory with good recall 
properties as shown in Figures 6.12(a) and (b) for levels of distortion up to 25%. 
Although it is difficult to make comparisons between the two different memory 
architectures and different network sizes, the input vector noise sensitivity of 
the bidirectional associative memory makes it rather applicable for low-noise 
situations. 


Improved Coding of Memories 

The generic bidirectional associative memory described in this section suf- 
fers from certain limitations of error-free retrieval of stored pairs. Let us see 
whether the memory performance can be improved by possibly reshaping its 
energy function. During the decoding, the energy (6.54b) remains stable or 
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% Corrupdon 

□ p=2 Ap=4 O p=6 X /?= 8 


Convergence vs. Corruption 

Network Type: Bidirectional Associative Memory (BAM) 

NetWork Parameters: Dimension — n X m (25 X 9) 

Threshold — Hard Limited 
@ Hamming Distance HD = 45 
Curve Parameters: Pattem Pairs p 
Features: 20 samples per point. 

Comments: The upper limit on the number of pattems that can be stored in the memory is 
min This network is fairly insensitive to noise for corruption levels kept 

below 10% at half of its upper limit. Serious deterioration is observed at near 
full capacity where the networks fail to converge even to “true” memories. 


Figure 6.20 Convergence versus corruption in a BAM for a different number of stored associa- 
tion pairs p. 


decreases until it reaches the minimum at (a'\ t/), which is of the following 
value: 

£(a,b) = -a^Wt/ (6.62) 

The energy E has a local minimum at (a^,b^) when none of the vectors at 



6.5 BIDIRECTIONAL ASSOCIATIVE MEMORY 


365 


HD = 1 from af or l/ will yield a lower energy value than (6.62). Consequently, 
when the energy (6.54b) for one of the trained p pairs (6.48) does not constitute 
a local minimum, then the association between a pair of stored vectors cannot 
be recalled even if the search argument is a member of a stored pair. This would 
mean in practice that there are unrecoverable stored associations [see Problem 
P6.20(a) for case studies]. The reason for this deficiency of bidirectional memory 
is that the coding formula (6.51) does not ensure that stored pairs correspond to 
local energy minima. 

Indeed, inspection of the gradient of the energy function (6.56a) and (6.56b) 
reveals that at the m’th association pair its gradient \E is equal: 

\E (a (m) , b (m) ) = - £ a ( ' ) b (,), b (m) (6.63a) 

Note that the gradient (6.63a) is computed at the m’th association pair, 
specifically at b (w) . This implies that if the fc’th bit a ^ of the vector a (m) changes 
by A a^\ the corresponding energy change A E ak is equal to 

AE ak (a (m) , b (m) ) = ~ ( £ a (,, b ( ' ) 'b (m) j A af* (6.63b) 

Assumption of a local minimum E (a (m) , b (m) ) now requires the proof that the 
expression in parentheses of (6.63b) and A be of different signs so that 
AE ak (a (w) ,b (m) ) is positive. This condition, however, is not guaranteed, since 
the update value is A = ±2, and the sign of the expression in parenthesis 
does not altemate with the fc’th bit changing sign. Therefore, the retrieval of the 
association a (/M) , b (m) is not guaranteed for the simple outer product encoding 
method as in (6.51). 

To ensure that the local minima of energy (6.62) are £(a (/w) , b (m) ), the mul- 
tiple encoding strategy has been proposed (Wang 1990). For multiple training of 
order q for the (a (w) ,b (m) ) pair, one augments the matrix W from (6.51) by an 
additional term 

AW = tø - l)a (w) b (w) ' (6.64) 

This indicates that the pair is encoded q times rather than one time; thus its 
energy minimum is enhanced. The coding of the m’th pair among the p existing 
pairs is thus enhanced through amplification of local minima E (a (w) ,b (w) ). 

Using multiple coding (6.64) to produce a distinet energy minimum at 
(a (w) ,b (w) ) however, may produce another pair, or pairs, of vectors in the train- 
ing set that becomes nonrecoverable. This situation can be remedied by further 
multiple training. The multiple training method guarantees that eventually all 
training pairs will be recalled. Indeed, the weight matrix obtained using #-tuple 
training of the m’th pair can be written as 

W = qa im) b {m)t + £ a (0 b (0 ' 

i+m 


(6.65a) 
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For large q , the first term in (6.65a) dominates W so 

W = q2L {m) h {m)t (6.65b) 

Then the recall according to (6.50) proceeds for input b (m) as 

a 1 = qT [a (w) (b (m) ^b ( ' n) )] (6.66) 

Since b (m) 'b (,w) = m > 0, a (m) is perfectly recalled in this pass. A similar single- 
step recall would be guaranteed if a (w) were used as the search argument in the 
#-tuple training case of the m’th pair. 

The multiple training method has been tested for randomly generated sizes 
of memory 4 < n, artd m < 256 with a random p value such that 2 < p < 16, 
and with limiting of the memory load to p < ^min (n,m) (Wang 1990). The 
simulation of memory performance resulted in 98.5% of correct recalls for the 
multiple training method versus 90% of correct recalls for the original bipolar 
associative memory recording algorithm. The multiple training method has also 
increased network capacity by 20 to 40%. Capacity notion is understood here as 
an estimate of the number of recallable training pairs p with a recall success of 
90%. The number of training pairs for capacity simulation was selected at four 
to eight times larger than p to seek the canceling effects of the noise term. 

The probability of correct convergence versus the number of stored pairs 
is shown in Figure 6.2 l(a) for n = m — 100. Curves 1 and 2 correspond to 
unipolar binary vectors used for training the bidirectional associative memory 
in the single training and multiple training modes, respectively. Curves 3 and 4 
are for the respective training modes using bipolar binary vectors for recording. 
Although the multiplicity q of the training has not been explicitly stated in the 
literature, it is known that it has been kept above the value p(2/n). 

To guarantee the recall of all training pairs, a method of dummy augmenta- 
tion of training pairs may be used (Wang 1990). The idea of dummy augmentation 
is to deepen the energy minima at (a (/) , b (z) ), for i = 1,2, . . . , p, without adding 
any risk of false decoding, which normally increases as p increases. This can be 
accomplished by generating a noise-free training set obtained through the gen- 
eration of augmented vectors containing dummy vector elements. The noise-free 
training set concept can be understood based on the noise expression (6.53) in 
terms of the Flamming distance between vectors. The pairs (a (z) ,a (;) ) or (b (0 , b (;) ) 
are called noise-free if for i = 1 , 2, j = 1 , 2, . . . , p 

HD (a (0 ,a 0) ) = ^ 

, N m (6-67) 

HD (b w ,b 0) ) = - 

For a noise-free training set used to create the memory weight matrix W, the 
recall of data would be immediate and error free. The training involving the 
dummy augmentation strategy, however, requires augmenting both vectors a (/) 
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Number of training pairs 
□ Kosko +M.T. O Kosko AM.T. 
(a) 


Figure 6.21a Illustration of improved coding strategies for bipolar heteroassociative memory: 
(a) convergence versus number of encoded pairs. (Adapted from Wang (1990). © IEEE; with 
permission.) 


and b ( *\ for i = 1, 2, p, with additional data (Wang 1990). An example of 
a noise-free training set is shown in Figure 6.21(b). Notice the dummy augmen- 
tation data present in the two bottom rows of the augmented training pattems 
a w , i = 1, 2, 3. The bottom rows data do not belong to the original pattem. 
Their addition to the original training vectors makes the augmented vectors or- 
thogonal according to (6.67). Thus, the dummy augmentation training method 
corresponds to artificial orthogonalization of stored vectors that were originally 
nonorthogonal. 

The concepts of multiple training and dummy augmentation of training pairs 
offer significant advantages over the original recording algorithm for discrete-time 
bidirectional associative memory. They improve the percentage of correct recalls 
for cases in which the search arguments are both the members of the training 
set or their distorted counterparts. Dummy augmentation can be guaranteed to 
produce recall of all training pairs if attaching the dummy data to the training 
pairs is possible and allowable. One disadvantage though is that the memory size 
needs to be increased to handle the storage and retrieval of augmented vectors. 
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(b) 

Figure 6.21b Illustration of improved coding strategies for bipolar heteroassociative memory 
(continued): (b) training pairs for dummy augmentation method. (Adapted from Wang (1990). 
© IEEE; with permission.) 


Multidirectional Associative Memory 

Bidirectional associative memory is a two-layer nonlinear recurrent network 
that accomplishes a two-way associative search for stored stimulus-response asso- 
ciations (a w , b (f) ) , for / = 1,2,...,/?. The bidirectional model can be generalized 
to enable multiple associations (a w ,b (l \c (l \...), / = 1, 2, ..., /?. The multiple 
association memory is called multidirectional (Hagiwara 1990) and is shown 
schematically in Figure 6.22(a) for the five-layer case. Layers are interconnected 
with each other by weights that pass information between them. When one or 
more layers are activated, the network quickly evolves to a stable State of multi- 
pattem reverberation. The reverberation which ends in a stable State corresponds 
to a local energy minimum. 

The concept of the multidirectional associative memory will be illustrated 
with the three-layer network example shown in Figure 6.22(b). Let (a ( *\b (/) ,c (/) ), 
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Layer A 



(a) (b) 

Figure 6.22 Multidirectional associative memory: (a) five-tuple association memory architecture 
and (b) information flow for triple association memory. 


for i = 1, 2, . . . , p, be the bipolar vectors of associations to be stored. General- 
ization of formula (6.51a) yields the following weight matrices: 

= £ a (,) b ( ' ) ' 

(=i 

W C5 = £ c (,) b (0 ' (6.68) 

1=1 

W AC = £ a°V°' 

/= 1 

where the first and second subscript of matrices denote the destination and 
source layer, respectively. With the associations encoded as in (6.68) in directions 
B — > A, B — ► C, C — > A, and reverse direction associations obtained through the 
respective weight matrix transposition, the recall proceeds as follows: Each neu- 
ron independently and synchronously updates its output based on its total input 
sum from all other layers: 

a' = r [W AB b + W^c] 

b'=F[W^c + W>] (6.69) 

C' = r [W CB b + w^ c c] 

The neurons’ States change synchronously according to (6.69) until a multidirec- 
tionally stable State is reached. 
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MAM Layers 


BAM Layers 



A B 






(Correct recollection) 



(False recollection) 


Figure 6.23 Synchronous MAM and BAM example. (Adapted from Hagiwara (1990). © IEEE; 
with permission.) 


Figure 6.23 displays snapshots of the synchronous convergence of three- and 
two-layer memories. The bit map of the originally stored letter A has been cor- 
rupted with a probability of 44% to check the recovery. With the initial input as 
shown, the two-layer memory does not converge correctly. The three-directional 
memory using additional input to layer C recalls the character perfectly as a 
result of a multiple association effect. This happens as a result of the joint 
interaction of layers A and B onto layer C. Therefore, additional associations 
enable better noise suppression. In the context of this conclusion, note also that 
the bidirectional associative memory is a special, two-dimensional case of the 
multidirectional network. 


ASSOCIATIVE MEMORY 
OF SPATIO-TEMPORAL 
PATTERNS 

The bidirectional associative memory concept can be used not only for 
storing p spatial pattems in the form of equilibria encoded in the weight matrix; 
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it can also be used for storing sequences of pattems in the form of dynamic State 
transitions. Such pattems are called temporal and they can be represented as an 
ordered set of vectors or functions. We assume all temporal pattems are bipolar 
binary vectors given by the ordered set, or sequence, S containing p vectors: 

S = { s<V 2) ,...,s^} (6.70) 

where column vectors for i = 1, 2 are / 2 -dimensional. The neural 
network is capable of memorizing the sequence S in its dynamic State transitions 
such that the recalled sequence is 

s (/+1) = n Ws (0 ] (6.71) 

where T is the nonlinear operator as in (6.5) and the superscript summation is 
computed modulo p 4- 1. Starting at the initial State of x(0) in the neighborhood 
of s ( *\ the sequence S is recalled as a cycle of State transitions. This model was 
proposed in Amari (1972) and its behavior was mathematically analyzed. The 
memory model discussed in this section can be briefly called temporal associative 
memory. 

To encode a sequence such that s (1) is associated with s (2) , s (2) with s (3) , 
and s (p) with s (1) , encoding can use the cross-correlation matrices s (/+1) s (/)r . 
Since the pair of vectors s (/) and s ( * +1) can be treated as heteroassociative, the 
bidirectional associative memory can be employed to perform the desired asso- 
ciation. The sequence encoding algorithm for temporal associative memory can 
thus be formulated as a sum of p outer products as follows 

W = V s (i+1) s (i) ' + s (1) s (p) ', or (6.72a) 

1=1 

W = £ s ( ' + V> (6.72b) 

i= 1 

where the superscript summation in (6.72b) is modulo p + 1. Note that if the 
unipolar vectors s (,) are to be encoded, they must first be converted to bipolar 
binary vectors to create correlation matrices as in (6.72), as has been the case 
for regular bidirectional memories encoding. 

A diagram of the temporal associative memory is shown in Figure 6.24(a). 
The network is a two-layer bidirectional associative memory modified in such a 
way that both layers A and B are now described by identical weight matrices W. 
We thus have recall formulas 

a = HWb] (6.73a) 

b = T[Wa] (6.73b) 

where it is understood that layers A and B update nonsimultaneously and in 

an altemate circular fashion. To check the proper recall of the stored sequence, 
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(b) 


Figure 6.24 Temporal associative memory: (a) diagram and (b) pattern recall sequences (for- 
ward and backward). 


vector k = 1, 2, . . . , p> is applied to the input of the layer A as in (6.73a). 
We thus have 
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The vector net a in brackets of Equation (6.74) contains a signal term ns^ k+l) 
and the remainder, which is the noise term ttj 

TI = £ s (!+1) (s (0 's w ) (6.75) 

where the superscript summation is modulo p 4- 1. Assuming the orthogonality 
of the vectors within the sequence S, the noise term is exactly zero and the 
thresholding operation on vector /zs ( * +1) results in s ( * +1) being the retrieved vec- 
tor. Therefore, immediate stabilization and exact association of the appropriate 
member vector of the sequence occurs within a single pass within layer A. Sim- 
ilarly, vector s ( * +1) at the input to layer B will result in recall of s ( * +2) . The 
reader may verify this using (6.73b) and (6.72). Thus, input of any member of 
the sequence set S, say results in the desired circular recalls as follows: 

s (* +1 > — ► s (k+2) s (p) — * s (1) — ► This is illustrated in Figure 6.24(b), 

which shows the forward recall sequence. 

The reader may easily notice that reverse order recall can be implemented 
using the transposed weight matrices in both layers A and B. Indeed, transposing 
(6.72b) yields 

W' = i s (0 s ( ' +1)r (6.76) 

1=1 

When the signal term due to the input is the recall of will 

follow. 

Obviously, if the vectors of sequence S are not mutually orthogonal, the 
noise term iq may not vanish, even after thresholding. Still, for vectors stored 
at a distance HD <C n , the thresholding operation in layer A or B should be 
expected to result in recall of the correct sequence. 

This type of memory will undergo the same limitations and capacity bounds 
as the bidirectional associative memory. The storage capacity of the temporal 
associative memory can be estimated using expression (6.61a). Thus, we have 
the maximum length sequence to be bounded according to the condition p < n. 
More generally, the memory can be used to store k sequences of length p l9 p 2 , 
. . . , p k . Together they include: 

P=ipi (6.77) 

i= 1 

pattems. In such cases, the total number of pattems as in (6.77) should be kept 
below the n value. 

The temporal associative memory operates in a synchronous serial fashion 
similar to a single synchronous update step of a bidirectional associative memory. 
The stability of the memory can be proven by generalizing the theory of stability 
of the bidirectional associative memory. The temporal memory energy function 
is defined as 

E = - £ s (l+1) Ws (i) 

1=1 


(6.78) 
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Calculation of the energy increment due to changes of s (k) produces the following 
equation: 


E = 


— s ( * +,) Ws® - s®Ws (i_1) - 


£ s (i+1) Ws (0 

i±k 
i* k-1 


(6.79) 


The gradient of energy with respect to s* becomes 


V S E = -W's ( * +1) - Ws a_l) 

Considering bitwise updates due to increments As® we obtain 


(6.80) 


A A = - 


£ *A +l) 

U=i 


+ 


i=l j 


As® 


(6.81) 


Each of the two sums in parentheses in Equation (6.81) agree in sign with A s\ k) 
under the sgn ( net )) update rule. The second sum corresponds to net i due to the in- 
put s ( * -1) , which retrieves s ( ^ in the forward direction. The first sum corresponds 
to net i due to the input s^ k+l \ which again retrieves s ( ^ in the reverse direction. 
Thus, the energy increments are negative during the temporal sequence retrieval 
s (i) s ( 2 ) — > ... s (/?) . As shown by Kosko (1988), the energy increases 

stepwise, however, at the transition s (p) — > s (1) , and then it continues to decrease 
within the complete sequence of p — 1 retrievals to follow. 


EXAMPLE 6.4 


In this example temporal associative memory is designed for the vector 
sequence consisting of vectors s (1) , s (2) , s (3) arranged in sequence S: 

S = {[1 -1 -1 1 -!]',[! 1 -1 -1 1]',[1 -1 1 ~1 1]'} (6.82) 


The memory weight matrix W is obtained from (6.72) as 

W = s (2) s (l) ' + s (3) s (2) ' + s (1) s (3) ' 


(6.83a) 


Substituting the temporal vectors specified, the weight matrix (6.83) 
becomes 


‘ 3 

-1 

-1 

-1 

r 


-1 

-1 

-1 

3 

-3 


-1 

3 

-1 

-1 

1 

(6.83b) 

-1 

-1 

3 

-1 

1 


1 

1 

-3 

1 

-1 



Now evaluate recall of the encoded sequence S . With input s (1) , the layer 
response is 

r[Ws (1) ] = r[3 7 -5 -5 5]' 


(6.84) 
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which yields the desired vector s (2) after thresholding. It is interesting to note 
from (6.74) and (6.75) that the noise vector for the computed transition 
is 


which reduces to 



*ni = [-2 


2 0 0 0 ]' 


(6.85a) 


(6.85b) 


The magnitude of each component of is smaller than the magnitude of 
the corresponding component of 5s (2) ; thus, recall is correct in this step due 
to the sufficient value of the signal-to-noise ratio. Further updates of the 
memory yield 

r[Ws (2) ] = r[5 -7 5 -3 3]' (6.86) 

and thus recall of s (3) as required, with the noise term 

t|2 = [ 0 -2 0 2 — 2 ]‘ (6.87) 

Finally, s (1) is recalled due to the thresholding operation 

r[Ws <3) ] = r[5 -7 -3 5 -5]' (6.88) 

with the noise vector 

t| 3 = [0 2 2 0 0]' (6.89) 

This terminates the recall of the sequence s (2) , s <3) , s (1) . m 


6.7 i — — 

CONCLUDING REMARKS 

The associative memories studied in this chapter are a class of artificial neu- 
ral networks capable of implementing complex associative mapping in a Space 
of vector information. The domain of such mapping is a set of memory vectors, 
usually having binary values. Recording, or leaming, algorithms are used to 
encode the required mapping between the memory input and output. In addition 
to coded memories, however, stable false attractors may often result as a side 
effect of the coding algorithm. 

Interesting observations can be drawn from the comparison of the artificial 
neural memory architectures just studied to the capacity of the human brain. 
Biological studies have estimated the number of neurons in a human cortex to 
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be of order of 10 11 (Amit 1989). According to our conclusions in Section 6.4, a 
fully connected cortex with 10 11 synapses could store alO 11 pattems, where a is 
the proportionality constant having a lower bound on its estimated value of 0.14. 
Each of the pattems can be understood at a first glance as a 10 u -tuple pattem 
not correlated with any other pattem in the cortex. This results in a number of 
alO 22 bits of storage space apparently available in the brain. As tentative as this 
rough estimate is, just its order of magnitude indicates an information pressure 
to which any brain model relates. 

But since the nerve network of the cortex is not fully connected, the above 
assumption does not hold. According to neurobiological research estimates, each 
neuron has only about 10 4 synapses. The cortex can thus be viewed as a system 
of 10 7 elementary networks each comprising 10 4 fully connected neurons. Such 
a system can store and retrieve al O 4 pattems of 10 4 bits each. Thus, the total 
informational capacity is alO 7 X 10 4 X 10 4 , which is equal to alO 15 bits. A 
similar order of magnitude capacity results if we consider a network of 10 u 
neurons to be a uniformly connected network with a mean of 10 4 synapses per 
neuron (Amit 1989). 

The estimate of the number of information bits that can penetrate the central 
nervous system in a human lifetime through the sensory organs is 10 20 bits. This 
estimate, from von Neumann (1958), is based on the neurobiologically determined 
mean rate of neural activity (14 /s), the number of input channels equal to the 
estimated number of binary neurons in the brain (10 10 ), and the mean length of 
human life (10 9 s). 

There is still a five order of magnitude difference between the number of 
information bits penetrating the natural human memory (10 20 bits) and its inher- 
ent capacity (alO 15 bits). The explanation of this discrepancy can be examined 
by considering the correlation between the ensuing information clusters. Once 
the network organizes, there are no reasons for independent storage of related 
pattems. Correlated pattems are storable within their basins of attractions. There- 
fore, any significant correlations between leamed pattems reduce the number of 
neurons required for storage. 

The following are the most important quality criteria for design of neural 
associative memories with an auto- or heteroassociative processing capability 
(Hassoun and Youssef 1989): 

1. a large capacity 

2. an immunity to noisy or partial input 

3. few false States which are pseudomemories 

4. no memory limit cycles which are oscillatory States 

5. implementation of the HD criterion or some other similar measure of 

neighborhood. 

A number of recording algorithms and memory architectures have been covered 
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in this chapter. More approaches have been proposed in the literature. As a result, 
improved recording techniques have been proposed to optimize the association 
process. The reader is referred to the specialized literature for further reading 
on the topic. A high-performance recording technique based on the Ho-Kashyap 
algorithm is given by Hassoun and Youssef (1989) and Hassoun (1989). Another 
approach involves a hidden layer of neurons, which increases the orthogonality of 
the stored pattems so that, similar to a Hopfield memory, the stored pattems are 
easier to retrieve (Hoffman and Davenport 1990). Higher order leaming rules are 
also introduced by involving higher order correlations between the stored pattems 
(Chen et al. 1986). Despite applications of various configurations and recording 
methods, however, none of the associative memories has yet been designed to 
be perfect and fault-free. 

Due to the parallel processing mode, associative memories can be success- 
fully applied if correctly designed. An important design aspect should involve 
accounting for a safe memory operation margin. This should include the study 
of existing possible trade-offs between the memory size and the numbers of 
stored pattems or their pairs. The study should also involve the expected noise 
tolerance, distances between stored pattems, etc. 

Neural associative memories are potentially useful in a wide range of appli- 
cations such as content-addressable storage, search-and-retrieve tasks, optimiza- 
tion computing, image restoration, pattem recognition, classification, and code 
correcting (Grant and Sage 1986). Other applications include storage of words 
and of continuous speech. In optical signal processing, associative memories are 
used for code filtering, code mapping, code joining, code shifting, and projecting. 


PROBLEMS 

Piease note that problems highlighted with an asterisk (*) are typically com- 
putationally intensive and the use of programs is advisable to solve them. 

P6.1 The linear associator has to associate the following pairs of vectors: 



1 

2 

_5 

6 

1 

6 


1 

2 


1 

6 


1 

6 



f (1) = [0 1 
-f (2 > = [1 
-+f< 3 > = [o 


0 ]' 

0 1 ]' 
0 0 ]' 


(a) Verify that vectors s (1 \ s (2) , and s (3) are orthonormal. 

(b) Create partial weight matrices for each desired association. 

(c) Compute the total weight matrix. 
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Figure P6.2 Linear associator network. 


(d) Verify the association performed by the network for one of the spec- 
ified pattems 

(e) Distort s (l) by setting one of its components to zero and compute the 
resulting noise vector. 

P62 The linear associator shown in Figure P6.2 has been trained to associate 
three unipolar binary output vectors to the following input vectors: 


*(D = 


S< 2 > = 


s (3) = 


1 

t 

i 

T 

1 

T 


V3 

O 


V3 

Ti 


V6 V6 


Find the weight matrix W. Then compute vectors f <2 \ and f <2) , which 
have been encoded in the network. 

P6.3 Assume that a linear associator has been designed using the cross- 
correlation matrix for heteroassociative association of p orthonormal pat- 
tems. Subsequently, another orthonormal pattem s (p+1) associated with 
f( P +i) mus t be stored. An incremental change in the weight matrix needs 
to be performed using the cross-correlation concept. Prove that the asso- 
ciation s ( ^ +1) -4 f results in no noise term present at the output. 

P6.4 The linear autoassociator needs to associate distorted versions of vectors 
s 1? s 2 , and s 3 with their undistorted prototypes specified in Problem P6.1. 
Calculate the weight matrix for the linear autoassociator network. 

P6.5 A two-neuron recurrent autoassociative memory has one stable State. Its 
energy function is of the form E = — ^v'Wv — i'v, so it includes the 
bias term. The energy function for the Circuit is shown in Figure P6.5. 
Compute elements of W and i and draw the network diagram. 
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Figure P6.5 Energy surface for memory network in Problem P6.5. 

P6.6 The memory recording algorithm expressing the Hebbian leaming rule 
with subsequent nullification of the diagonal matrix 

W = £ s (m) s (m) ' - pi 

m= 1 

can also be used for unipolar binary vector storage. Prove that this mode 
of coding is invariant under the complement operation s (m) = i.e., it 

leads to identical weight matrices in both cases. 

P6.7 A three-bit autoassociative recurrent memory updating under asynchronous 
rule stores a single vector = [ 1 1 -1 ]*. Prepare a State transition 

map using a three-dimensional cube with the energy values assigned to 
each vertex, and with the direction of all possible transitions assigned to 
each edge of the cube. Then evaluate all transitions using each vertex as 
an initial condition. 

P6.8 A blank State transition map for a recurrent autoassociative memory 
with asynchronous update is shown in Figure P6.8. Assume that s (1) = 
[ 1 1 — 1 1 ] 1 and s< 2 > = [ 1 — 1 1 — 1 ] ; need to be stored. 

(a) Compute E(\). 

(b) Mark energy values at each node of the map. 

(c) Label all possible transitions on the State diagram. 
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1 — 1 — 1—1 — 1 — 1 — 1—1 



Figure P6.8 Blank State transition map for Problem P6.8. 


(d) Identify whether or not any initial conditions occur that lead to final 
States that are either spurious memories or that are true memories but 
are nonoptimal in the Hamming distance sense. 

P6.9 A four-neuron Hopfield autoassociative memory has been designed for 
s (i) - —\ —i l] . Assume that the recall is performed syn- 

chronously, i.e. all four memory outputs update at the same time. 

(a) Draw the State transition map for this recall mode assuming that any 
of the 16 States can be considered to be the initial one. 

(b) Compute E(\). 

(c) Determine if the energy value decreases in this update mode. 

P6.10 The folio wing vectors need to be stored in a reeurrent autoassociative 
memory: 

s (1) =[l 1111]' 

s (2) =[l -1 -1 1 -1]' 

s (3) =[-l 1 1 1 1]' 


(a) Compute the weight matrix W. 
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(b) Apply the input vector v° = [ 1 -1 -1 1 l] ? and allow for 

asynchronous convergence in ascending node order starting at node 1. 

(c) With the same input vector as in part (b) allow for asynchronous 
convergence in descending node order starting at node 5. 

(d) Comment as to whether Hopfield memory provides an optimal solu- 
tion in terms of the Hamming distance criterion. [The results may be 
surprising since this is an overloaded memory. Assume sgn (0) = +1 
for update purposes in this problem.] 

P6.ll The following unipolar binary vectors must be stored in the recurrent au- 
toassociative memory using the outer product method with the nullification 
of the diagonal: 

s (1) = [l 0 0 10]' 

s (2) = [0 110 1]' 

s (3) =[l 10 10]' 

(a) Compute matrix W. 

(b) Find the analy tical expression for the energy function that the memory 
is minimizing. 

P6.12 Assume that the recurrent autoassociative memory has been designed for 
p < n orthogonal pattems encoded in its weight matrix. Show that present- 
ing the key pattem s (/7+1) , which is orthogonal to the encoded pattems 
s (1) , s (2) , . . . , s (p) , but has not been stored, results in the initial energy 
value E(s ( p+ 1) ) = + Then compute the energy minima values for 
this memory for each of the stored vectors s (1) , s (2 \ . . . , s (p) . 

P6.13 Compute the energy values for all 16 bipolar binary key vectors for a five- 
bit autoassociative recurrent memory having the following weight matrix 


(complement vectors need not 

be checked): 
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By comparing the energy levels, prepare a hypothesis regarding the two 
stored vectors. Then verify your hypothesis by comparing the computed 
weight matrix with the matrix W specified above. 

P6.14 Vectors a (1) = [ 1 - 1 1 1 ] 1 and a (2) = [ — 1 1 - 1 1 ] 1 represent 

the bit maps shown in Figure P6.14. They need to be autoassociated for 
input vectors at HD = 1 from the prototypes stored. Design the following 
two autoassociators: 
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a (1) =[l — 11 l] r 

Figure P6.14 Bit maps for Problem P6.14. 



a (2) =[ — 11—11]' 


(a) a linear autoassociator in the form of a linear associative memory 
with added thresholding element (TLU) at the output, if needed, but 
no feedback 

(b) a recurrent autoassociative memory with asynchronous updating. 

Compare the performance of both networks by evaluating responses to 
eight input vectors at HD = 1 from a (1) and a (2) . The nodes of the mem- 
ory designed in part (b) should be updated in ascending order starting at 
node 1. 

P6.15 A recurrent autoassociative memory has been designed to store s (1) = 

[ 1 - 1 1 1 1 ] X and s (2) = [ — 1 1 1 — 1 1 ] *. The energy val- 

ues have been computed and marked on the map of State transitions for 
this memory as shown in Figure P6.15. By analyzing the memory State 
diagram for all transitions at HD = 1, determine if this memory has any 
false spurious terminal States. If yes, what are they and what are the input 
vectors that could possibly be associating with the false memories? 

P6.16* Design the autoassociative recurrent memory to store the bit maps of the 
characters A , /, and O as specified in Figure P6.16. List the resulting 
W matrix. Implement the asynchronous retrieval algorithm starting at the 
upper leftmost pixel of the map, moving left to right during the update. 
Simulate the asynchronous convergence of the memory toward each of 
the characters stored in the following cases of key vectors: 

(a) left upper pixel of each character reversed 

(b) four center column pixels of the character field reversed. 

Provide the retrieved vectors in each of six cases and note the number of 
recursion cycles. (Each recursion cycle consists here of a sequence of 16 
updates.) 



PROBLEMS 


383 


i 



(Energy values encircled) 


Figure P6.15 Unfinished State transition diagram for Problem P6.15. 

P6.17 Analyze graphically the recurrent network with one neuron as shown 
in Figure 6.15 for the two cases of neuron activation functions as shown in 
Figure P6.17a,b. Determine whether points A , B , C, and D are stable or 
unstable fixed points, or cyclic solutions. 

P6.18 The table here lists three scalar energy functions E(x) that are often 
encountered in the studies of associative memory. Their respective expres- 
sions for gradient vectors V£(x) are also listed. Derive the expressions for 
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gradient vectors for the energy functions listed in the left column. Assume 
W is (nX n) and i and x are ( n X 1). 


Case 

E(x) 

V£(x) 

i 

i'x 

i 

2 

x'Wi 

Wi 

3 

x'Wx + i'x 

(W + W)x + i 


P6.19* Design the bidirectional associative memory that associates characters 
(A,C), (/,/), and (0,7) designed using the bit maps of A, /, and O from 
Problem P4.16 and of C, /, and T from Figure 4.19. 

(a) Compute W. 

(b) Check associations within stored pairs using search arguments as the 
stored characters. 

P6.20 * Given are the three following training pairs (a (/) ,b (0 ), for i = 1, 2, 3, for 
bidirectional associative memory: 


ii 

[{i -l-iiii-i -i -i]'. 
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(a) (b) 


Figure P6.17 Activation functions for neuron in Problem P6.17. 


Find the weight matrix W for the associative memory. Then: 

(a) Determine that the input a (2) does not result in retrieval of b (2) ; find 

retrieved instead of b (2) . 

(b) Find E( a^\b^) and £(a (2) ,b (2) ) and compare their values. 

(c) Determine whether or not £(a (2) ,b (2) ) is a local minimum. This can 
be done by evaluating £(a (2 \b (2) ) and comparing it with the energy 
values at points that are at HD = 1 from a (2) and then from b (2) . 

P6.21 Prove the following theorem: If a training pair (a (/w) ,b ( ^) is a local 
minimum of the energy function as in (6.54b), then the recovery is 
guaranteed since 

F[Wb (m) ] = a m and r[Wa ( "°] = b {m) 

[Hint: Examine the condition for local minimum 
-a (m) Wb (m)r < — a'^Wb^ 

where a' (w) is defined as differing by HD = 1 from a (m) . Repeat for b' (m) 
differing by HD = 1 from b (w) .) 

P6.22* Some of the three training pairs in Problem P6.20 coded according to 
(6.51) are not recoverable when training of multiplicity 1 is used. How- 
ever, a multiple training method can be used to guarantee the recovery. 
Apply the multiple training method to obtain the weight matrix with guar- 
anteed recovery for each member of the training pair. Try multiplicity of 
2 or 3 in order to test the encoding of each pair. Devise the lowest 
multiplicity of training for each pair that would guarantee recovery. 
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P6.23 A temporal associative memory needs to be designed for recall of the 
following sequence: 
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-1 
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-1 

, s< 2 > = 

i 

, s® = 
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-i 
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-1 


-i 


-i 


Compute the weight matrix W and check the recall of pattems in the 
forward and backward directions. Assume a bipolar binary activation func- 
tion. 


P6.24 The weight matrix of the temporal associative memory is known as 
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Knowing that a vector s (1) = [-1 1 —1 —1 l] r belongs to a se- 

quence, find the remaining vectors of the sequence. Having found the 
full sequence, verify that encoding it actually yields the weight matrix W 
as specified in the problem. Calculate the noise term vectors generated 
at each recall step and determine that they are suppressed during the 
thresholding operation. 
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7.1 Hamming Net and MAXNET 

7.2 Unsupervised Learning of Clusters 

7.3 Counterpropagation NetWork 

7.4 Feature Mapping 

7.5 Self-Organizing Feature Maps 

7.6 Cluster Discovery NetWork (ÅRTI) 

7.7 Concluding Remarks 


T he neural networks covered in previous chapters represent a number of 
distinet network classes. We studied singie-layer and multilayer percep- 
tron networks, which are essentially feedforward recall architectures trained with 
supervision. Continuous- and discrete-time singie-layer networks with recurrent 
recall form another separate architectural group of networks. Associative memo- 
ries can also be treated as a distinet class of neural networks. Despite somewhat 
diversified architectures, associative memories exhibit a common learning mode 
based on auto- and cross-correlation between stored vectors. Although other 
learning techniques exist, memories are typically designed with fixed weights 
by using the technique of batch learning, or recording. 

The networks we discuss in this chapter represent a combination of the ar- 
chitectures and recall modes studied thus far. One novel aspect is the networks’ 
learning mode. Our purpose in this chapter is to present networks that leam 
neither by the familiar correlation rule, nor by the perceptron or gradient descent 
techniques we have studied. In faet, no feedback from the environment will be 
provided during learning. The network must discover for itself any relationships 
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of interest that may exist in the input data. Our interest is in designing networks 
that are able to translate the discovered relationships into outputs. We will see 
in this chapter that discovery of pattems, features, regularities, or categories can 
be leamed without a teacher. 

Networks trained without a teacher usually leam by matching certain ex- 
plicit familiarity criteria. These networks can produce output that tells us how 
familiar it is with the present pattem. This is done by comparing the present 
pattem with typical pattems seen in the past. Consider that the present pattem 
is evaluated for “similarity” with typical pattems from the past. One important 
measure of similarity used for leaming studied in this chapter is the maximum 
value of the scalar product of the weights and input vector. Using the scalar 
product metric, weights can be trained in an unsupervised mode to resemble 
frequent input pattems, or better, pattem clusters. Another often used measure of 
similarity is the topological neighborhood, or distance, between the responding 
neurons arranged in regular geometrical array s. 

Our objective in this chapter is to explore the unsupervised training algo- 
rithms of neural networks so that their weights can sensibly adapt during the 
process called self-organization . Networks trained in this mode will not only 
react to values of inputs but also to their statistical parameters. In this chapter 
we will assume that both inputs and their probability density will affect the 
training. Rare inputs will have less impact for leaming than those occurring 
frequently. Sometimes, training inputs will need to be applied in their natural 
sequence of occurrence. 

Unsupervised leaming may seem at first impossible to accomplish. Indeed, 
at the beginning of the leaming process, the network’s responses may be im- 
plausible. The lack of a teacher ’s input forces the network to leam gradually 
by itself which features need to be considered in classification or recognition 
and thus need to be reinforced. At the same time, other less important features 
can be neglected or suppressed based on the mapping criteria that underlie the 
unsupervised leaming algorithms. The proximity measures allow the computation 
of important indicators of how to differentiate between more and less important 
features during the unsupervised leaming. What is very peculiar to this leaming 
mode is that the indicators have to be created based on the history of leaming 
experience. 

Unsupervised leaming can only be implemented with redundant input data. 
Redundancy provides knowledge about statistical properties of input pattems. In 
faet, there are close connections between statistical approaches to pattem classi- 
fication and the neural network techniques discussed here (Duda and Hart 1973). 

In the absence of target responses or guidelines provided direetly by the 
teacher, the network can, for example, build classification /recognition /mapping 
performance during the process devised in Chapter 2 as competitive leaming. 
The winning output node will receive the only reward during leaming in the 
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form of a weight adjustment. In some cases, the extension of rewards for the 
node’s neighborhood will also be allowed. Strengthening the winning weights 
gradually builds into the network a history of what has been seen in the past. 
Such networks are termed networks with history sensitivity. 

The neural network architectures covered in this chapter include the Ham- 
ming network, MAXNET, the clustering Kohonen layer, the Grossberg outstar 
leaming layer, a counterpropagation network, self-organizing feature mapping 
networks, and an adaptive resonance network for cluster discovery and classifi- 
cation of binary vectors. 


7.1 __________ 

HAMMING NET AND MAXNET 

In this section a two-layer classifier of binary bipolar vectors will be cov- 
ered. The block diagram of the network is shown in Figure 7.1. It is a minimum 
Hamming distance classifier, which selects the stored classes that are at a min- 
imum HD value to the noisy or incomplete argument vector presented at the 
input. This selection is essentially performed solely by the Hamming network. 
The Hamming network is of the feedforward type and constitutes the first layer of 
the classifier. The p-class Hamming network has p output neurons. The strongest 
response of a neuron is indicative of the minimum HD value between the input 
and the category this neuron represents. The second layer of the classifier is 
called MAXNET and it operates as a recurrent recall network in an auxiliary 
mode. Its only function is to suppress values at MAXNET output nodes other 
than the initially maximum output node of the first layer (Lippmann 1987). 

As stated, the proper part of the classifier is the Hamming network respon- 
sible for matching of the input vector with stored vectors. The expanded diagram 
of the Hamming network for classification of bipolar binary n - tuple input vectors 
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Figure 7.1 Block diagram of the minimum HD classifier. 
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Figure 7.2 Hamming network for n-bit bipolar binary vectors representing p classes: (a) classi- 
fier network and (b) neurons’ activation function. 


is shown in Figure 7.2(a). The purpose of the layer is to compute, in a feed- 
forward manner, the values of (« — HD), where HD is the Hamming distance 
between the search argument and the encoded class prototype vector. Assume 

that the n-tuple prototype vector of the m’th class is s (m) , for m = 1,2 p, 

and the n-tuple input vector is x. Note that the entries of the weight vector w m 
defined as 

= [w„i W m2 ... W mn ]' , for m = 1, 2, ...,p 

connect inputs to the m’th neuron, which performs as the class indicator. 

Before we express the value of the Hamming distance by means of the 
scalar product of x and s (w \ let us formalize the metric introduced here. A 
vector classifier with p outputs, one for each class, can be conceived such that 
the m’th output is 1 if and only if x = s (m) . This would require that the weights 
be w m = s (m) . The classifier outputs are x's (1) , x's (2) , . . . , x's (m) , . . . , x's (p) . When 
x = s {m \ only the m’th output is n, provided the classes differ from each other, 
and assuming ±1 entries of x. The scalar product of vectors has been used here 
as an obvious measure for vector matching. 

The scalar product x r s (m) of two bipolar binary n-tuple vectors can be written 
as the total number of positions in which the two vectors agree minus the number 
of positions in which they differ. Note that the number of different bit positions 
is the HD value. Understandably, the number of positions in which two vectors 
agree is n — HD. The equality is written 

x's (m) = (n - HD(x,s w )) - HD(x,s (m) ) 


(7.1a) 
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This is equivalent to 

ix's (m) = | - HD(x,s <m} ) (7.1b) 

We can now see that the weight matrix W H of the Hamming network can be 
created by encoding the class vector prototypes as rows in the form as below 


W„ = 


.0) «(l) 

s l S 2 

JZ) ( 2 ) 
s l s 2 


cd)' 

c(2) 


(7.2) 


åp) Jp) 
S 1 s 2 


o (P) 


where the Vi factor is convenient for scaling purposes. Now the network with 
input vector x yields the value of (V6)xV m) at the input to the node m, for m = 1, 
2, Adding the fixed bias value of w/2 to the input of each neuron results 

in the total input net m 


net m = 7 xV m) + for m = 1, 2, . . . , p 


2 2 

Using the identity (7.1b), net m can be expressed as 
net m = n- HD(x,s (m) ) 


(7.3a) 


(7.3b) 


Let us apply neurons in the Hamming network with activation functions as in 
Figure 7.2(b). The neurons need to perform only the linear scaling of (7.3b) 
such thaif(net m ) = ( \/n)net m , for m - 1, 2, p. Since inputs are between 
O and n, we obtain the outputs of each node scaled down to between O and 1. 
Furthermore, the number of the node with the highest output indeed indicates 
the class number to which x is at the smallest HD. A perfect match of input 
vector to class m, which is equivalent to the condition HD = O, is signaled by 
f(net m ) = 1. An input vector that is the complement of the prototype of class m 
would result in f(net m ) = 0. The response of the Hamming network essentially 
terminates the classification in which only the first layer of network from Figure 
7.1 computes the relevant matching score values. As seen, the classification by 
the Hamming network is performed in a feedforward and instantaneous manner. 

MAXNET needs to be employed as a second layer only for cases in which 
an enhancement of the initial dominant response of the m’th node is required. As 
a result of MAXNET recurrent processing, the m’th node responds positively, 
as opposed to all remaining nodes whose responses should have decayed to 
zero. As shown in Figure 7.3(a), MAXNET is a recurrent network involving 
both excitatory and inhibitory connections. The excitatory connection within the 
network is implemented in the form of a single positive self-feedback loop with 
a weighting coefficient of 1. All the remaining connections of this fully coupled 
feedback network are inhibitory. They are represented as M — 1 cross-feedback 
synapses with coefficients —s from each output. The second layer weight matrix 
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Figure 7.3 MAXNET for p classes; (a) network architecture and (b) neuron’s activation function. 


W iM of size p X p is thus of the form (Pao 1989) 



— s 
—s 


1 


(7.4) 


where s must be bounded 0 < s < 1 lp. The quantity s can be called the lateral 
interaction coefficient . With the activation function as shown in Figure 7.3(b) 
and the initializing inputs fulfilling conditions 

0 < 1, for 

the MAXNET network gradually suppresses all but the largest initial network 
excitation. When initialized with the input vector y°, the network starts processing 
it by adding positive self~feedback and negative cross-feedback. As a result of 
a number of reeurrences, the only unsuppressed node will be the one with the 
largest initializing entry y° . This means that the only nonzero output response 
node is the node closest to the input vector argument in HD sense. The recurrent 
processing by MAXNET leading to this response is 

y k+i = r[W M y*] 


(7.5a) 
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where T is a nonlinear diagonal matrix operator with entries /(•) given below: 


f(ne,) - { i 


net < 0 
net > 0 


(7.5b) 


Each entry of the updated vector y* +1 decreases at the fc’th recursion step of 
(7.5a) under the MAXNET update algorithm, with the largest entry decreasing 
slowest. This is due to the conditions on the entries of matrix W M as in (7.4), 
specifically, due to the condition 0 < e < i/p. 

Assume that y° m > i = 1,2,...,/? and i + m. During the first recuirence, 
all entries of y 1 are computed on the linear portion of f(net). The smallest of 
all y° entries will first reach the level f{net) = 0, assumed at the £’th step. The 
clipping of one output entry slows down the decrease of y* +1 in all forthcoming 
steps. Then, the second smallest entry of y° reaches f(net) = 0. The process 
repeats itself until all values except for one, at the output of the m’th node, 
remain at nonzero values. The example below explains the design and operation 
of the discussed classifier. 


EXAMPLE 7.1 


Let us design the minimum Hamming distance classifier for three characters 
C, /, and T as shown in Figure 4.19 for which the bipolar binary class 
prototype vectors are 

s (1) = [1 1 1 1 -1 -11 1 1]' 

s (2) =[-l l-l-ll -1 -1 1 -1]' (7.6a) 

s (3) =[l 11-11-1-11 -l] f 


The weight matrix W w of the Hamming network obtained from (7.2) is 


W„ = 


1 

-1 

1 


1 1 1-1-1 1 
1 - 1-1 1 -1 -1 
1 1-1 1-1-1 


i r 

i -i 
i -i 


(7.6b) 


Note that the matrix of Equation (7.6b) determines the bipolar binary values 
of weights Wy. Using formula (7.3a), the input vector net is expressed as 
follows: 



1 

-1 

1 


m 

111-1-1111 
1 - 1-1 1 - 1-1 1-1 
1 1-1 1 - 1-1 1-1 

L 2 J 



(7.6c) 


Selecting s = 0.2 < l A terminales the design process of both layers. The 
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— — Weights of — 1 


Figure 7.4 Hamming network and MAXNET for Example 7.1. 


designed network is shown in Figure 7.4. Let us look at sample responses 
of both layers of the network. Assume the test input vector to be 

x = [1 1 1 1 1 1 1 1 l]' 

For the Hamming layer neurons’ activation and responses, respeetively, we 
obtain from (7.3a) with this input 



Since the computed output, T[net], is the input vector to MAXNET, we 
thus have T[net] = y°. The MAXNET recurrent formula (7.5a) yields the 
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following activation: 


net* = 


net 

i 

netj 

neti 




“ 

k 




• * 


y 2 


* 


_y 3 _ 


T k+\ _ 


and the following response computed from (7.5b) 

/(net^) 

If(net^) J 

In the recurrences k = 0, 1, 2, 3, the network produces the following 
activations and responses, respectively: 


Step 1 : k = 0: 


net 


1 

- 0.2 

- 0.2 


- 0.2 - 0.2 

1 - 0.2 

- 0.2 1 



'0.777' 


'0.599' 


0.333 


0.067 


,0.555 


0.333 


Step 2: k = 1: 


Step 3: k = 2: 


Step 4: k = 3: 


y 


i 


0.599 

0.067 

0.333 


net 1 = 

' 0.520' 
-0.120 

. y 2 = 

'0.520' 

0 


0.120 


.0.120 


net 2 = 

' 0.480' 
-0.14 
0.096. 

. y 3 = 

'0.480' 

0 

.0.096. 

net 3 = 

' 0.461' 
-0.115 

-10“ 7 

. y 4 = 

'0.461' 

0 

0 
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This terminates the recursion, since for further recurrences all values of 

y*+! = y 4 , f Qr £ 

> 3. Notice that the thresholding began in Step 2 due 
to the conditions at the second neuron of the MAXNET. The activation 
value net\ — —0.12 has resulted in the output y\ = 0 in this step. This 
has obviously slowed down the rate of decrease of other nonzero outputs 
and, more importantly, of y l . The result computed by the network after 
four recurrences indicates that the vector x presented at the input for the 
minimum Hamming distance classification has been at the smallest HD from 
s (1) and, therefore, it represents the distorted character C. wm 

Let us summariz'e the benefits of the discussed neural network architecture. 
As stated, the Hamming net and MAXNET jointly implement the optimum min- 
imum bit error classification. Let us compare the size of this classifier with its 
Hopfield autoassociator network counterpart. Assume that a classifier needs to 
classify a 50-tuple vector belonging to one of the five classes. The Hamming 
network requires a total of 255 connections to 5 neurons. For comparison, a Hop- 
field autoassociative memory would involve 2450 connections to 50 neurons. In 
addition, Hopfield memory is sensitive to the HD parameter within the set of 
stored pattems. As a result of a random search for the minimum of the energy, 
Hopfield network may produce spurious responses. Also, the network has rather 
stringent capacity limits. 

This comparison indicates that the Hamming network looks somewhat ad- 
vantageous as a minimum HD classifier. Moreover, the network is rather straight- 
forward to build and, excluding its MAXNET stage, is of the simple feedforward 
type. However, the Hamming network retrieves only the closest class index and 
not the entire prototype vector. The Hamming network is therefore not able to 
restore any of the key pattem vector entries, if input is a coirupted version of 
the stored prototype, but can only provide the closest similarity measure of the 
input to the set of prototypes. We should thus realize that the Hamming network 
is only a pure classifier and not an autoassociative memory. It provides passive 
classification and has no built-in data restoration mechanisms. Such mechanisms 
are naturally embedded, however, in recurrent autoassociative architectures. 

The reader has undoubtedly noticed that Hamming network and MAXNET 
weights are designed by recording rather than through incremental leaming, 
whether supervised or unsupervised. We decided to study Hamming networks 
in the chapter dealing with unsupervised leaming networks, however, because 
Hamming distance classifiers use the same similarity measure as many of the 
neural networks that leam in an unsupervised environment. Namely, the network 
detects the match between the stored prototype and the key vector by computing 
and comparing their scalar product. This is accomplished by encoding the pro- 
totype vector into respective weights of the neurons being the class indicator for 
this specific prototype. 
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7.2 

UNSUPERVISED LEARNING 
OF CLUSTERS 

Our discussion thus far has focused on the classification, recognition, and 
association tasks performed by neural networks that have been previously trained. 
The training has been either supervised, or network parameters have been ex- 
plicitly computed based on the design requirements in lieu of training. We have 
consistently assumed the existence of a training set containing both inputs and 
required responses. In this section we present unsupervised classification leam- 
ing. The leaming is based on clustering of input data. No a priori knowledge is 
assumed to be available regarding an input’s membership in a particular class. 
Rather, gradually detected characteristics and a history of training will be used 
to assist the network in defining classes and possible boundaries between them. 

Such unsupervised classification, called clustering, can possibly be thought 
of as primary compared to a classification of membership in classes that have al- 
ready been formed. In faet, evolutionary leaming by humans must have originated 
thousands of years ago in an unsupervised form, since there were no teachers, 
instructions, templates, or books then. Over the years, all known living things, 
for example, have first been classified as belonging to eertain clusters according 
to their observable characteristics. The single and sensible criterion used has 
been the objects’ similaries within a cluster. As a result of this grouping, clusters 
were labeled with appropriate names that best reflect their characteristics. Later 
on, objects were given names following more detailed classification criteria that 
were developed within clusters. 

The objective of clustering neural networks discussed in this section is to 
categorize or cluster data. The classes must first be found from the correlations 
of an input data stream. Since the network actually deals with unlabeled data, 
the clustering should be followed by labeling clusters with appropriate category 
names or numbers. This process of providing the category of objects with a label 
is usually termed as calibration. 


Clustering and Similarity Measures 

Clustering is understood to be the grouping of similar objects and separating 
of dissimilar ones. Suppose we are given a set of pattems without any informa- 
tion as to the number of classes that may be present in the set. The clustering 
problem in such a case is that of identifying the number of classes according 
to a eertain criterion, and of assigning the membership of the pattems in these 
classes. The clustering technique presented below assumes that the number of 
classes is known a priori. The pattem set {x l9 x 2 , ... ,x^} is submitted to the 
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Cluster 1 
^ 's 



(a) (b) 

Figure 7.5 Measures of similarity for clustering data: (a) distance and (b) a normalized scalar 
product. 


input to determine decision functions required to identify possible clusters. Since 
no information is available from the teacher as far as the desired classifier’s 
responses, we will use the similarity of incoming pattems as the criterion for 
clustering. 

To define a cluster, we need to establish a basis for assigning pattems to the 
domain of a particular cluster. The most common similarity rule, already used in 
Chapter 3, is the Euclidean distance between two pattems x, for x, defined as 


/(X - X;)'(X - X;) 


This mle of similarity is simple: the smaller the distance, the doser the pattems. 
Using (7.7), the distances between all pairs of points are computed. A distance 
T can then be chosen to discriminate clusters. The value T is understood as the 
maximum distance between pattems within a single cluster. Figure 7.5(a) shows 
an example of two clusters with a T value chosen to be greater than the typical 
within-cluster distance but smaller than the between-cluster distance. 

Another similarity mle is the cosine of the angle between x and x z -: 


cos ip — 



(7.8) 


This mle is particularly useful when clusters develop along certain principal and 
different axes as shown in Figure 7.5(b). For cos ip 2 < cos pattem x is more 
similar to x 2 than to x x . It would thus be natural to group it with the second of 
the two apparent clusters. To facilitate this decision, the threshold angle ip T can 
be chosen to define the minimum angular cluster distance. It should be noted, 
however, that the measure defined in (7.8) should be used according to certain 
additional qualifications. If the angular similarity criterion (7.8) is to be efficient, 
vectors x l5 x 2 , and x should be of comparable, or better, identical lengths. 
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A number of traditional cluster search algorithms such as maximum-distance, 
£-means, and isodata, are used in pattem recognition techniques (Tou and Gonza- 
lez 1974; Duda and Hart 1973). In our presentation, we will focus on techniques 
that involve weight training and employ connectionist algorithms for budding 
clusters rather than on conventional clustering methods. 


Winner-Take-AII Learning 

The network discussed in this section classifies input vectors into one of 
the specified number of p categories according to the clusters detected in the 
training set {x l9 x 2 , . . »x#}. The training is performed in an unsupervised mode, 
and the network c "undergoes the self-organization process. During the training, 
dissimilar vectors are rejected, and only one, the most similar, is accepted for 
weight building. As mentioned, it is impossible in this training method to assign 
/ (/ network nodes to specific input classes in advance. It is equally impossible to 
C; yci^ predict which neurons will be activated by members of particular clusters at the 
beginning of the training. This node to cluster assignment is, however, easily 
done Hy calibrating the network after training. 

The network to be trained is called the Kokonen network (Kohonen 1988) 
and is shown in Figure 7.6(a). The processing of input data x from the training set 
{x 1 ,x 2 ,... ,x N }, which represents p clusters, follows the customary expression 

f ~ , Xc Q j'‘ r* 

y = r[Wx] 1 ^ (7.9) 

with diagonal elements of the operator T being continuous activation functions 
operating componentwise on entries of vector Wx. The processing by the layer 
of neurons is instaniåneous and feedforward. To analyze network performance, 
we rearrange the matrix W to the folio wing form: 




for i = 1, 2, . . . , p 


is the column vector equal to the j’th row of the weight matrix W. Component 
weights of w m are highlighted in Figure 7.6(a) showing a winner-take-all learning 
network. The learning algorithm treats the set of p weight vectors as variable 


where 




w,- = 


H ’ii 

wn 
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Figure 7.6 Winner-take-all learning rule: (a) learning layer, (b) vector diagram, and (c) weight 
vectors on a unity sphere for p = n = 3. 

y£ 

y 

vectors that need to be leamed. Prior to the learning, the normalization of all 
weight vectors is required: 

A W 

W i = tt-At, for i =1,2,. ... , p 


The weight adjustment criterion for this mode of training is the selection of w. 
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such that 

ll x “ w m || = . min {||x-w,||} (7.10) 

/= 1 , 2 ,... ,p 

The index m denotes the winning neuron number corresponding to the vector 
w w , which is the clqsegt ^pproximation of the current input x. Let us see how 
this leaming should proceed in terms of weight adjustments. The left side of 
Equation (7.10) can be rearranged to the form 

ll x - wj| = (x'x - 2w / m x - 1) 1/2 (7.11) 

It is obvious from (7.11) that searching for the minimum of p distances as on 
the right side of (7.10) corresponds to finding the maximum among the p scalar 
products 

W^X = . ™ ax (w*x) (7.12) 

The left side of Equation (7.12) is the activation value of the “winning” neuron 
which has the largest value net,-, i — 1, 2, . . . , p. When using the scalar product 
metric of similarity as in (7.12), the synaptic weight vectors should be modified 
accordingly so^ that they become more similar to the current input vector. With 
the similarity criterion being cos tf/ as in (7.8), the weight vector lengths should be 
identical for this training approach. However, their directions should be modified. 
y Intuitively, it is clear that a very long weight vector could lead to a large output 
s v of its neuron even if there were a large angle between the weight vector and the 
pattem. This explains the need for weight normalization. 

After the winning neuron has been identified and declared a winner, its 
weights must be adjusted so that the distance (7.10) is reduced in the current 
training step. Thus, ||x — w m || must be reduced, preferably along the gradient 
direction in the weight Space w ml , w m2 , . .., w mn 

lix - wj 2 = -2(x - wj (7.13a) 


o~' 




Since vectors x have a certain probability distribution within a cluster and we are 
dealing^with a single adjustment step due to the single f realization of the input, 
only affraction of the increment in (7.13a) should be irivolved in producing the 
sensible weight adjustments. It seems reasonable to reward the weights of the 
winning neuron with an increment of weight in the negative gradient direction, 
thus in the direction x — w m . We thus have 


Aw^ = a(\ - w m ) (7.13b) 

^ where a is a small leaming constant selected heuristically, usually between 0.1 
and 0.7. The remaining weight vectors w /? i ^ m, are left unaffected. Note that 
the rule is identical to Equation (2.46) introduced in Chapter 2 in the framework 
of leaming rules. ^ 

Using a superscript to index the weight updates and restating the update 
Vi** criterion (7.10), the leaming mle (7.13b) in the &’th step can be rewritten in a 
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more formal way as follows: 

^m +1 = + ak ( X ~ *„) (7.13C) 

w* +1 = w k , for i # m (7.1 3d) 

where a k is a suitable leaming constant and m is the number of the winning neu- 
ron selected based on the scalar product comparison as in (7.12). While leaming 
continues and clusters are developed, the network weights acquire similarity to 
input data within clusters. To prevent further unconstrained growth of weights, 
a is usually reduced monotonically and the leaming slows down. 

Leaming according to Equations (7.12) and (7.13) is called “winner-take- 
all” leaming, and it is a common competitive and unsupervised leaming tech- 
nique. The winning node with the largest net i is rewarded with a weight adjust- 
ment, while the weights of the others remain unaffected. pi< 

This mode of leaming is easy to implement as a computer simulation; one 
) merely searches for the maximum response and rewards the winning weights 
only. In a real network it is possible to implement a winner-take-all layer by 
using units with lateral inhibition to the other neurons in. the form of inhibitory w j 
connections. At the same time, the neuron should pcissess' cxciUUory connections 
to itself like in the MAXNET network. Since all the weights must, in addition 
to providing stable opration of the layer, be modifiable as a result of combined 
excitatory/inhibitory interactions, such layers may not be easy to develop as a 
physical neural network capable of meaningful leaming. 

Let us look at the impact of the leaming rule (7.13c) and (7.13d) on the 
performance of the network. The rule should increase the chances of winning 
by the m ' th neuron as in (7.12) for repetition of the same input pattem using 
the updated weights. If the requirement holds, then inequality (7.14a) should be 
valid for the new weights w'„ 

K x < (#„ + 

Using (7.13) we obtain 

Aw^x > 0 
or 

x'x - w^x > 0 (7.14c) 

which is equivalent to 

||x'|| • ||x|| cosO - ||wj • ||x|| COS(/r > 0 (7 14d) 

Assuming normalized vectors x = x reduces (7.14d) to 

1— cosi/r>0 (7.14e) 

where i/> = 4(w, x). Since (7.14e) is always tme, the winner-take-all leaming 
rule produces an update of the weight vector in the proper direction. 


(7.14a) 

(7.14b) 
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It is instructive to observe the geometrical interpretation of the rule. Con- 
sider that weights are represented as vectors in Figure 7.6(b). Assume that in 
this step the normalized input vector denoted as x and the vector w m yield the 
maximum scalar product w'x, for i — 1, 2, . . . , p. Next, a difference vector 
x — w m is created as shown. To implement the rule of (7.13c) and (7.13d) for 
x = x, an increment of the weight vector is computed as a fraction of x - w m . 
The result of weight adjustment in this training step is mainly the rotation of the 
weight vector w m toward the input vector without a significant length change. 
The adjusted weight vector results as w' m and is of a length below unity. To 
begin with the new training step, v?' m must be renormalized. Let us notice that 
another input belonging to the m’th cluster would make the vector w m even more 
representative of the cluster m. 

In the long term this leaming mode leads to the weight vectors that approx- 
imate^The'bnsemblesi of past winning input vectors. However, since the weights 
are adjusted Tn proportion to the number of events that end up with weight 
adjustments, this network reacts to the probability of occiirrence of inputs. In 
this context, the network may be used as a clustering network for the particular 
^ ^ probability of training vectors coming from each cluster. After the leaming is 
completed, each w z represents the centroid of an fth decision region, i = 1, 2, 
. . . , p, created in the n-dimensional space of the pattem data. On the other side, 
the network possesses an interesting feature sensitivity, which will be discussed 
later in this chapter in more detail. In summary, vectors w after training will 
become organized much like the set of example vectors x used for training. 

Note that the neurons’ activation function is of no relevance for this leaming 
mode. Figure 7.6(c) illustrates an example of three weight vectors of unity length 
that have acquired the direction of an average pattem cluster. The case shown 
is for three clusters and three-dimensional input pattems. It can be seen that 
the normalized weights approximate the centers of gravity of their respective 
clusters, which are represented each by a point on the surface of a unit sphere 
and marked by a cross. 

Another leaming extension is possible for this network when the proper class 
for some pattems is known a priori (Simpson 1990). Although this means that 
the encoding of data into weights is then becoming supervised, this information 
accelerates the leaming proce^s significantly. Weight adjustments are computed 
in the supervised mode as in (7.13b) and only for correct classifications. For 
^improper clustering responses of the network, the weight adjustment carries the 
opposite sign compared to formula (7.13b). We may thus notice that a > 0 for 
proper node responses, and a < 0 otherwise, in the supervised leaming mode 
for the Kohonen layer. 

Another modification of the winner-take-all leaming rule is that both the 
winners’ and losers’ weights are adjusted in proportion to their level of responses. 
This may be called/leaky competitive learning and should provide more subtle 
leaming in the cases for which clusters may be hard to distinguish. 


rSj" 
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Recall Mode 

The network trained in the winner-take-all mode responds instantaneously 
during feedforward recall at all p neuron outputs. The response is computed 
according to (7.9). Note that the layer now performs as a filter of the input 
vectors such that the largest output neuron is found as follows 

y m = maxCy,,^ y p ) 

and the input is identified as belonging to cluster m. In general, the neurons’ 
activation functions should be continuous in this network. For some applications, 
however, y m = 1 and = 0, i ± m , must be set in the recall mode of the 
clustering layer. In this way, for example, the weights of the following layer can 
be fanned out from the activated node of this previous layer while other nodes 
remain suppressed. 

Before a one-to-one vector-to-cluster mapping can be made after the network 
is trained in the unsupervised mode, it needs to be calibrated in a supervised 
environment. The calibration involves the teacher applying a sequence of p 
best matching class / cluster inputs and labeling the output nodes 1, 2, . .., p, 
respectively, according to their observed responses. Obviously, the calibrating 
labels assigned to the physical neurons of the layer would vary from training to 
training depending on the sequence of data within the training set, the training 
parameters, and the initial weights. Once the clustering network is labeled, it can 
perform as a cluster classifier in a local representation. 


Initialization of Weights 

As stated before, preferably random initial weight vectors would be used for 
this training. This indicates that initial weights should be uniformly distributed 
on the unity hypersphere in ^-dimensional pattem space. Self-organization of 
the network using the described training concept suffers from some limitations, 
however. One obvious deficiency related to a single-layer architecture is that 
linearly nonseparable pattems cannot be efficiently håndled by this network. The 
second limitation is that network training may not always be successful even for 
linearly separable pattems. Weights may get stuck in isolated regions without 
forming adequate clusters. In such cases the training must be reinitialized with 
new initial weights, or noise superimposed on weight vectors during the training. 
After the weights have been trained to provide coarse clustering, the leaming 
constant a should be reduced to produce finer adjustments. This often results in 
finer weight tuning within each cluster. 

One of the weight selection methods developed for training the network 
is called convex combination (Hecht-Nielsen 1987). In this method all weight 
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vectors are initialized at the value 

w° = -L[l 1 ... 1]', for i = 1, 2, ..., p 

\Jn 

The leaming starts at the weights as above and proceeds as in expression 
(7.13c) and (7.13d) with a very low a value. This forces the weight vectors at 
the beginning of leaming to be close to the input vectors and to have equal 
lengths. As leaming progresses, a is slowly increased. This allows for the grad- 
ual separation of weights according to the input clusters used for training. This 
procedure improves the chances for successful training, but does slow down the 
process. 

The simple winner-take-all training algorithm activates only one output neu- 
ron for each input vector; it thus provides local representation. The training 
approach can be modified, however, to a form in which K winners in the compe- 
tition are considered and awarded weight increases accordingly. Such leaming is 
called multiple-winner unsupervised leaming. The winning K neurons would be 
the ones best matching the input vector, instead of a single one as in the winner- 
take-all mode. The outputs of winning neurons can be set so that they sum to 
unity. This interpolation process usually leads to an increased mapping accuracy 
in applications using the winner-take-all layer as the first layer. A network using 
this approach is said to be operating in an interpolative mode (Hecht-Nielsen 
1987). 

EXAMPLE 7.2 


We present here an example of a two-cluster case and provide a simple 
geometrical interpretation of winner-take-all network training based on the 
criterion of Equation (7.12). The iterative weight adjustment is unsupervised. 
The pattems are from a normalized example training set: 


i 

0.8' 


0.1736' 


'0.707' 

i 

0.6 

7 

-0.9848 

7 

0.707 


0.342 ' 


0.6' 

-0.9397 
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(7.15a) 


The set of test vectors is shown in Figure 7.7(a). Two output neurons, p = 2, 
have been selected to identify two possible clusters. The normalized initial 
weights selected at random are 



T 

o 
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0 

>yr 2 = 

0, 


The inputs are submitted in ascending sequence from the set as in (7.15a) 
and recycled x 1? x 2 , . x 5 , x 1? x 2 — Let us evaluate the adjustments of 
weight vectors during leaming. The reader can verify that for a = Vi we 


408 


MATCHINQ AND SELF-ORGANIZING NETWORKS 



Figure 7.7 Competitive leaming network of Example 7.2: (a) training pattems and weight as- 
signments and (b) weight leaming, Steps 1 through 20. 


obtain from (7.13b) and (7.13c) after renormalization 



0.948' 
0.316 J ’ 



This is because the first neuron becomes the winner with x, as input. For 
convenience we convert the training set (7.15a) to the polar form as follows: 


{x 1 ,x 2 ,x 3 ,x 4 ,x 5 } = {14.36.87°, 14_- 80°, 14_45°, 
14.-70°, 14.53.13°} 


(7.15b) 


Since wj = 14-18.46°, it can be seen that Wj has moved from its initial po- 
sition 14_0° toward the potential first quadrant cluster. Note that the weights 
of the second neuron, w 2 , are unaffected in the first training step. The next 
training step with input x 2 produces = 14.-30.77°, and w} = 14.180°, 
since node 1 is again the winner. The iterative weight adjustments continue. 
Figure 7.7(b) shows the tabulated results of weight iterations for k — 20. 
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Analysis of two data clusters produced by the network indicates that 
x 1? x 3 , and x 5 belong to the first quadrant cluster with the center at 


Xj + x 3 + x 5 
3 


= H45° 


Similarly, x 2 and x 4 form another cluster in the fourth quadrant located at 


x 2 + x 4 
2 


14 “75° 


For large k , weight vectors tend to become stabilized as follows: the 
first neuron weights center around 1 445° and the second neuron weights 
center around 14-75°. The geometrical interpretation of weight vector dis- 
placement during leaming shown on the unity circle of Figure 7.7(b) also 
indicates that the leaming has been successful. Weight vectors w 1 , w 2 , ... 
are marked in the unity circle as points indexed 1, 2, 3, . . . 20. We may 
notice, however, that the first neuron won all of the first six competitions in 
a row and initially started representing both data clusters at the same time. 
Despite that initial takeover, during the second training cycle, the cluster 
containing x 2 and x 4 has been detected and identified by the second neuron. 
Its respective weights were later reinforced during cycles 3 and 4. Finally, 
the weight values of the second neuron were organized by acquiring the 
values of second cluster inputs. ■■ 


Separability Limitations 

We now see that winner-take-all leaming can successfully produce single- 
layer clustering/classifying networks. However, the networks will only be train- 
able using the criterion of Equation (7.12) if classes/ clusters of pattems are 
linearly separable from other classes by hyperplanes passing through origin. This 
is illustrated in Figure 7.8(a), which shows an example of three-class classification 
in the pattem space as generated by the trained network. 

In the case shown, three distinet clusters can be identified and classified. 
The weights responsible for clustering are marked Wj, w^, and w^. The scalar 
product metric-based clustering is impossible, however, for pattems distributed 
as shown in Figure 7.8(b). Since only a single neuron has the strongest response, 
the winner-take-all rule precludes that the same neuron would at the same time 
respond if pattems are on the negative side of the partitioning hyperplane. Adding 
trainable thresholds to neurons’ inputs may improve separability conditions. Even 
then, however, the two neuron network will not be able to leam the solution of 
an XOR problem, or similar linearly nonseparable tasks. An excessive number of 
neurons created in the winner-take-all layer could certainly also be of significance 
when leaming difficult clustering. 
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Figure 7.8 Separability illustration for a winner-take-all learning network (w f denotes the final 
weight vectors): (a) possible classification and (b) impossible classification. 


To ensure the separability of clusters with a priori unknown numbers of 
training clusters, the unsupervised training can be performed with an excessive 
number of neurons, which provides a certain separability safety margin. During 
the training, some neurons are likely not to develop their weights, and if their 
weights change chaotically, they will not be considered as indicative of clusters. 
Therefore, such weights can be omitted during the recall phase, since their outputs 
do not pro vide any essential clustering information. The weights of the remaining 
neurons should settie at values that are indicative of clusters. 


COUNTERPROPAGATION 

NETWORK 

The counterpropagation network is a two-layer network consisting of two 
feedforward layers. The network was introduced by Hecht-Nielsen (1987, 1988). 
In its simplest version, it is able to perform vector-to-vector mapping similar to 
^ heteroassociative memory networks. Compared to bidirectional associative mem- 
V V ory, there is no feedback and delay activated during the recall operation mode. 
The advantage of the counterpropagation network is that it can be trained to 
perform associative mappings much faster than a typical two-layer network. The 
counterpropagation network is useful in pattem mapping and associations, data 
compression, and classification. 
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Left- to -Right processing 



Right-to-Left processing 
(b) 

Figure 7.9 Counterpropagation network: (a) feedforward part and (b) full counterpropagation 
network, 

The network is essentially a partially self-organizing lobk-up table that maps 
IR” into R^ and is taught in response to a set of training examples. The objective 
of the counterpropagation network is to map input data vectors x, into bipolar 
binary responses z,-, for i = 1, 2, . p. We assume that data vectors can be 
arranged into p clusters, and that the training data are noisy versions of vectors 
Xp The essential part of the counterpropagation network structure is shown in 
Figure 7.9. It resembles, at first glance, the class of layered feedforward networks 
covered in Chapter 4. However, counterpropagation combines two different, novel 
leaming strategies, and neither of them is the gradient descent technique. The 
network’s recall operation is also different than for any previously seen architec- 
ture. 

The first layer of the network is the Kohonen layer, which is trained in the 
unsupervised winner-take-all mode described in the previous section. Each of the 
Kohonen layer neurons represents an input cluster, or pattem class, so if the layer 
works in local representation, this particular neuron’s input and response are the 
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largest. Similar input vectors belonging to the same cluster activate the same m’th 
neuron of the Kohonen layer among all p neurons available in this layer. Note 
that first-layer neurons are assumed with continuous aetivation function during 
leaming. However, during recall they respond with the binary unipolar values 0 
and 1. Specifically, when recalling with input representing a cluster, for example, 
m, the output vector y of the Kohonen layer becomes 


y\ yi ••• 


y? 


= [0 0 1 0 ] 


(7.16) 


Such a response can be generated as a result of lateral inhibitions within the layer 
which would need to be activated during recall in a physical system. The second 
layer is called the Grossberg layer due to its outstar leaming mode (Grossberg 
1974, 1982) as discussed in Chapter 2. The Grossberg layer, with weights v t j, 
functions in a familiar manner 


z = r[Vy] 


(7.17) 


with diagonal elements of the operator T being a sgn(*) function operating com- 
ponentwise on entries of the vector Vy. Let us denote the column vectors of 
the weight matrix V as v l5 v 2 , ...» v m , ..., \ p . Now, each weight vector \ m 
for i = 1, 2, . . . , /?, contains entries that are weights fanning out from the m’th 
neuron of the Kohonen layer. Substituting (7.16) into (7.17) results in 

z = r[vj (7.18) 


where \ m 


Vi« v 2w 



As can be seen, the operation of this layer with bipolar binary neurons is simply 
to output z, = 1 if v im > 0, and z i = —1 if v im < 0, for i = 1, 2, . . . , q. By 
assigning just any positive and negative values for weights v im highlighted in 
Figure 7.9, a desired vector-to- vector mapping x — » y — * z can be implemented 
by this architecture. This is done under the assumption that the Kohonen layer 
responds as expressed in (7.16). The target vector z for each cluster must be 
available for leaming so that the weight components of \ m can be appropriately 
made equal to + 1 or - 1 according to 


cite 


(7.19) 


However, this is a somewhat ovérsimplified weight leaming rule for this layer. 
This rule, which is of the batch type rather than incremental, would be appropri- 
ate if no statistical relationship existed between input and output vectors within 
the training pairs (x,z). In practice, such relationships often exist and they also 
would need to be established in the network during training. 

As discussed in the previous section, the training rule for the Kohonen layer 
has involved the adjustment of weight vectors in proportion to the probability 
of the occurrence and distribution of winning events. Using the outstar leaming 
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rule of Equ^tion (7.17) incrementally and not binarily^ as in (7.19) permits us 
to treat a c Stationary ådaitive noise in output z in d manner similar to the way 
we considered distributed clusters during the training of the Kohonen layer with 
“noisy” inputs. The outstar leaming rule makes use of the faet that the leaming 
of vector pairs, denoted by /he s set of mappings {(x^Zj), . v . ,(x p ,z p )}, will be 
doné gradually and thus involve eventual statistical balancing within the weight 
matrix V. The supervised leaming rule for this layer in such a case becomes 
ineremental and takes the form of the outstar leaming rule defined in (2.48): 

Av m = I 3(z - v m ) (7.20) 


where 0 is set to approximately 0.1 at the beginning of leaming and is gradually 
reduced during the training process (Wasserman 1989). Index m denotes the 
number of the winning neuron in the Kohonen layer. Vectors z h i = 1,2,..., p, 
used for training are stationary random process vectors with statistical properties 
/ that make the training plausible. Q 

' O v Note su P erv i se< ^ rule leaming according to (7.20) starts after 

u • f completion of the unsupervised training of the first layer. Also, as indicated, the 
weight of the Grossberg layer is adjusted if and only if it fans out from a winning 
neuron of the Kohonen layer. As training progresses, the weights of the second 
layer tend to converge to the average value of the desired outputs. Let us also 
note that the unsupervised training of the lirst layer produces active outputs at 
indeterminate positions^The ^segond layer introduces ordering in the mapping so 
' that the network becomes a desirable look-up memory table. During the normal 
recall mode, the Grossberg layer outputs weight values z = \ m connecting each 
output node to the winning first layer neuron. No processing, except for addition 
and sgn {net) computation, is performed by the output layer neurons if outputs 
are binary bipolar vectors. 

The network discussed and shown in Figure 7.9(a) is simply feedforward 
and does not refer to the counterflow of signals for which the original network 
was named. The full version of the counterpropagation network makes use of 
bidirectional signal flow. The entire^network consists of the doubled network 
from Figure 7.9(a). It can be simultaneously both trained and operated in the 
recall mode in an arrangement such as that shown in Figure 7.9(b). This makes 
it possible to use it as an autoassociator according to the formula 

' £]-[!$] (7 ' 21) 


Input signals generated by vector x, input, and by vector z, desired output, prop- 
agate through the bidirectional network in opposite directions. Vectors x' and z' 
are reSpecti ve outputs that are intended to be approximations, or autoassociations, 
of x and z, respectively. 

Let us summarize the main features of this architecture in its simple feed- 
forward version. The counterpropagation network funetions in the recall mode 
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as a nearest match look-up table. The input vector x finds the weight vector 
w m that is its closest match among p vectors available in the first layer. Then 
the weights that are entries of vector v w , and are fanning out from the winning 
ra’th Kohonen’s neuron, after sgn(-) computation, become binary outputs. Due to 
the speciflc training of the counterpropagation network, it outputs the statistical 
averages of vector z associated with input x. In practice, the network performs 
as well as a look-up table can do to approximate vector matching. 

Counterpropagation can also be used as a continuous function approximator. 
Assume that the training pairs are (x /5 z ,) and z t = g(\ t ), where g is a continuous 
function on the set of input vectors {x}. The mean square error of approximation 
can be made as small as desired by choosing a sufficiently large number p of 
Kohonen’s layer neurons. However, for the continuous function approximation, 
the network is not as efficient as error back-propagation trained net works, since 
it requires many more neurons for comparable“accuracy (Hecht-Nielsen 1990). 
Counterpropagation networks can be used for rapid prototyping of a mapping and v 
to speed up system development, since they typically require orders of magnitude.> ; 
fewer training cycles than is usually needed in error back-propagation training. ov/;) 

The counterpropagation can use a modified competitive training condition "' 
for the Kohonen layer. Thus far we have assumed that the winning neuron, 
for which weights are adjusted as in (7.13c) and (7.13d), is the one fulfilling 
condition (7.12) of yielding the maximum scalar product of the weights and the 
training pattem vector. Another alternative for training is to choose the winning 
neuron of the Kohonen layer such that instead of (7.12), the minimum distance 
criterion (7.10) is used directly according to the formula 

II* - w mll = . min {||x - w,||} (7.22) 

i=l,2,.. . y p 

The remaining aspects of weight adaptation and of the training and recall mode 
would remain as described in the previous section for criterion (7.12). The only 
difference is that the weights do not have to be renormalized after each step in 
this training procedure. 


FEATURE MAPPING 

In this section we focus on the feature mapping capabilities qf neural 
networks. Special attention is devoted to mappings that can be arranged in ge- 
ometrically regular self-organizing arrays. As discussed in Section 3.1, feature 
extraction is an important task both for classification or recognition and is often 
necessary as a preprocessing stage of data. In this way data can be transformed 
from high-dimensional pattem space to low-dimensional feature space. Our goal 
in this section is to identify neural architectures that can leam feature mapping 
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Figure 7.10 Pattern structure: (a) naturat similarity and (b) no natural similarity. 


without supervision. Before introducing such architectures we will first reexamine 
the concept of features and their extraction. 

The pattems in the pattem space can be distributed in two different struc- 
tures. Although it is hard to define these precisely, we can say the that pattems 
in Figure 7.10(a) have a certain “natural” structure while the pattems in Figure 
7.10(b) do not. The natural structure is often related to pattem clustering or their 
mutual location in the original pattem space. The natural pattem structure appar- 
ently makes the perception of them by a human observer easier. In contrast to 
the harmonious perception of such classes that show common features, a chaotic 
mixture of pattems is much more difficult to separate, memorize or classify in 
any way. The basic difference in identifying “natural” versus “unnatural” pat- 
tems structure is that the pattem space may be similar to our human perception 
space, or they may be different. The mismatch between the pattem space and 
our perception space is causing the apparent lack of pattem similarity or lack of 
their natural stmcture. 

In our discussion we are interested in two aspects of mapping features. 
It is important to reduce the dimensionality of vectors in pattem space when 
representing these vectors in feature space. To facilitate perception, it is equally 
important to provide as natural a stmcture of features as possible. We are inter- 
ested in easy perception of low-dimension pattems no matter how “unnatural” 
their distribution may be in the original pattem space. 

Some pattems exhibit inherently smaller dimensionality that can be per- 
ceived by superficial inspection. The inherent dimensionality reduction results 
from the explicit relationship that may exist between variables or the correlation 
between data. The number of degrees of freedom in such relationships can be 
lower than perceived initially. Figure 7.11 shows an example of constrained free- 
dom pattem data. It depicts the longitudinal displacement d of a point moving 
with constant acceleration a. Assume that the pattem space is defined as speed 
v and distance d from the origin. Since the movement parameters are related 
as d - v 2 / (2a), the d(v), motion equation, can be considered one-dimensional 
in pattem space v, d. This limitation of dimensionality is due to the physical 
restriction of this motion on the number of degrees of freedom. Most often we 
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Figure 7.11 Reduction of degree of freedom: (a) movement parameters versus time and (b) 
pattern space. 


are interested in reducing the dimensionality of pattems in feature space in more 
involved cases when no analytical relationships such as the one documented in 
Figure 7.11 can be established. 

Pattems in multidimensional space, which often can be clustered in one-, 
two- or three-dimensional feature space, may have a much more complicated 
structure in the original pattem space. Our goal is to find a self-organizing neural 
array that consists of neurons arranged in a virtual one-dimensional segment or 
in a two-dimensional array. Such an array is typically a rectangle or triangle, or a 
spatial lattice-type connection of neurons within a three-dimensional cuboid. The 
neurons of such an array are connected to the input vector as shown in Figure 
7.12(a) for an example two-dimensional array. Note that each component in the 
input vector x is connected to each of the nodes, and the weight w ( j transmits 
the input Xj toward the fth node of the feature array. 




(a) (b) 

Figure 7.12 Mapping features of input x into a rectangular array of neurons: (a) general 
diagram and (b) desirable response peaks for Figure 7.1 0(b). 
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Figure 7.13 Planar pattern to planar feature space mapping: (a) 64-dimensional excitation in 
pattern space and (b) translation of intercept and angle to planar feature array. [Adapted from 
Tattershall (1989). © Chapman & Hall; with permission.] 


The output y t of a neuron within the array is a function of similarity S 
between the input vector x and w f - defined as 

y, = /(S(x,w ; )) (7.23) 

We studied in Section 7.2 a simplified version of the unsupervised training of 
clusters using as the scalar product a similarity metric. Before we exploit similar- 
ity metrics any further, let us look at certain interesting aspects of dimensionality 
reduction. 

It seems desirable to filter the input data onto feature space so that the 
pattems in the image space, as shown in Figure 7.10(b), are mapped into one 
of the easy to perceive clustering formats. An acceptable arrangement would be 
that of the leftmost square of Figure 7.10(a). Using the extraction of features ar- 
chitecture of Figure 7.12(a), the planar layer of neurons, called the feature array , 
should possibly display distinet peaks when excited with original pattems. Thus, 
inputs from classes 1, 2, and 3 should preferably result in perceptually separated 
excitation peaks as shown in Figure 7.12(b). In the example case illustrated, we 
map a two-dimensional pattem space into a two-dimensional feature space. 

A question arises about the dimensionality of the original planar data. As- 
sume that a number of pattem points are given on the plane or in two-dimensional 
pattem space. A computational experiment has been designed in which a two- 
dimensional top array of neurons has been trained using planar input data from 
the bottom plane (Tattershall 1989). As shown in Figure 7.13(a), the input pat- 
tems’ dimensionality has been made equal to the number of actual planar inputs. 
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In the experiment reported, planar data coded as 64-dimensional input vectors 
from the bottom plane have been used for training of the 8 X 8 top planar neuron 
layer shown. 

After training, the single excitations of point inputs highlighted on the figure 
have been successfully mapped into single peaks of neuron responses at positions 
directly above the excitations. Thus, the one-to-one geometrical mapping of peak 
responses generated just above excitations indicates that the dimensionality of 
any pattem data taken from a two-dimensional Space is actually only two. The 
dimensionality of the pattem representation has no significance. Therefore, the 
described projection of planar data always results in two-dimensional images. 

In a related experiment (Tattershall 1989) illustrated in Figure 7.13(b), a 
straight line within a square 10 X 10 pixel field has been mapped into a 100-neuron 
feature array arranged in a square. Since the line has 2 degrees of freedom, two 
line parameters, which are the intercept with axis and slope, have been produced 
by the top network trained with 100-dimensional input vectors. The application 
of a large number of training lines encoded as 100-tuple input vectors causes the 
feature array to self-organize. Subsequently, an input line forces the output to peak 
at the corresponding coordinates of intercept and angle. Obviously, calibration 
of line parameters on the top neural array would also be required after training. 
The parameter readings can then be made for the performed mapping. 

In the following discussion we will look at the details of how regularly 
shaped neuron layers can arrange and regularize the data from the original pat- 
tem space based on a similarity metric. Mappings will be discussed that are at the 
same time able to preserve pattem space relationships and to reduce the dimen- 
sionality of the representation space. The key concept here is the self-organization 
within a regular array of neurons. This self-organization is demonstrated below 
based on the lateral feedback concept. 

Let us begin by looking for clues about self-organization that may exist 
in living biological systems. Many biological neural networks in the brain are 
found to be essentially two-dimensional layers of processing neurons densely 
interconnected by lateral feedback sy napses as shown in Figure 7.14(a). The 
figure shows a linear array, which is a cross section of a planar array cutting 
through the neuron i. The biological neurons of the layer have lateral connection 
strengths that depend on the distance as illustrated in Figure 7.14(b). The imme- 
diate neighborhood of the 50- to 100- fim radius is characterized by short-range 
lateral excitatory connections. The self-feedback produced by each biological 
neuron connecting to itself is positive. It is expressed by the feedback coefficient 
value y > 0 as shown in the figure. The excitatory area is surrounded by a ring 
of weaker inhibitory connections of 200- to 500- fim radius (Kohonen 1984). The 
feedback connections’ weights are shaped like a Mexican hat. It is therefore often 
mentioned in the literature that the layer Figure 7.14(a) displays the “Mexican 
hat” type of lateral interaction. 
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Figure 7.14 Lateral connections for the clustering demonstration: (a) interconnections of neu- 
rons and (b) lateral- and self-feedback strength. 

_____ 


It is interesting to observe the response of a planar array of neurons with 
lateral connections. A two-dimensional layer of neurons responds with an “ac- 
tivity bubble” produced at a location where the input is maximum. The center 
of the “activity bubble” is the center of the excitation, too. The typical planar 
array response pattem is shown in Figure 7.15 displaying circular array s of ac- 
tivated neurons. Reducing positive lateral feedback of individual neurons results 



Figure 7.15 Planar activity formation for various strengths of lateral interaction: (a) strong 
positive feedback and (b) weak positive feedback. [Adapted from Kohonen (1984). © Springer 
Verlag; with permission.] 
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in a smaller radius for an activated array. This can be seen from comparison 
of Figures 7.15(a) and (b). These phenomena are due to the self-organization 
process. The example below explains and illustrates the observed phenomena in 
more detail. 


EXAMPLE 7.3 


Let us look at the self-organization of a linear array of n neurons arranged 
as shown in Figure 7.16(a). The linear array, or row, of neurons can be 
thought of as representing an arbitrary cross section through the center of 
an activity bubble of the planar array, examples of which are presented in 
Figure 7.15. 

The response of the fth neuron in a linear array can be expressed as 

yf+tWr*') (7-24) 


ko 


y t (t + 1) -/ x t (t + 1) + £ 


k=-k 0 


The recursive form of Equation (7.24) is due to the small delay in feeding 
signals y(f) back to the input; the delay value has been normalized here to 
unity for computational convenience. The feedback coefficients y k are shown 
in Figure 7.16(b) as a function of intemeuronal distance. The coefficients 
represent a discretized feedback strength function that can be thought of as 
an envelope for y k . They are represented as a function of discrete variable k 
denoting the linear distance rather than as a lateral feedback function defined 
versus continuous lateral distance. Notice that the neuron’s index refers now 
to the neuron’s location in a row. The excitation function for the experiment 
has been arbitrarily selected with maxima for the fifth and sixth neurons 
and is defined below in (7.25a). This choice of excitation produces peak 
local activity formation for the spatial neighborhoods of neurons 5 and 6. 
Conditions for the simulation experiments are chosen as 

" tt {i + 3)1 


jq(0 = 0.5 sin 


17 


for i = 1, 2, 


10 


(7.25a) 


The neuron’s activation function and positive and negative feedback coeffi- 
cients are, respectively, 

{ 0, net < 0 

net, 0 < net < 2 (7.25b) 

2, net ^ 2 

b = 0.4, c = 0.2 


Fifteen recursive computational steps (7.24) have been performed for the 
row of 10 neurons. The results are listed in Figure 7.17(a). The first row 
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Figure 7.16 Formation of activity in cross section: (a) one-dimensional array of neurons and (b) 
example lateral feedback. 


of the table contains excitation values as in (7.25a). It can be seen that 
neurons 5 and 6 formed peak and saturated responses after 15 recursions. 
They attained a constant level of 2 which is the maximum response of the 
node. The responses of neurons 1, 2, 3, 8, 9, and 10 have been laterally 
attenuated to zero for all cases studied. The response formation took place 
gradually over the iterations, as can be seen by comparing rows of the table 
and by inspecting the four curves shown. 

The impact of the lateral feedback coefficients b and c on the response 
of the linear array is shown in Figure 7.17(b). It can be seen that increasing 
the positive feedback coefficients from 0.4 to 0.5 substantially widens the 
radius of maximum response. Increase of negative feedback coefficients from 
0.2 to 0.25 narrows the transition activity regions between the maximum and 
normal response. Values of c that are too small can also increase the region 
of maximum responses. m 

We have observed how activity clusters within contiguous regions are gen- 
erated in a linear array of neurons. The observations and conclusions from this 
computational experiment can be helpful in formalizing an algorithm that would 
implement meaningful unsupervised leaming within the self-organizing array. 
Since the center of activity is at the center of excitation, the weight adaptation 
algorithm should find the point of maximum activity and activate its respec- 
tive spatial neighborhood. In this way adaptive weight vectors can be tuned to 
the input variable. Although we only considered an example of the linear ar- 
ray, a simple generalization provides the formation of localized activity to more 
dimensions. 
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Figure 7.17a Clustering of activity within a cross section: (a) neuron outputs for Steps 0 
through 15. 
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Figure 7.17b Clustering of activity within a cross section (continued): (b) steady-state activity 
for different feedback coefficients. 


7.5 

SELF-ORGANIZING 
FEATURE MAPS 

The feature mapping algorithm is supposed to convert pattems of arbitrary 
dimensionality into the responses of one- or two-dimensional arrays of neurons. 
The time-domain recursion illustrated in Example 7.3 needs to be replaced by a 
self-organizing weight leaming rule. 

First we discuss simple one-dimensional mapping. Assume that a set of 
input pattems {*,, i = 1,2,...} has been arranged in an ordering relation with 
respect to a single feature within the set. One of the simplest systems thought 
to produce localized responses is a linear array of neurons that receive the same 
set of input signals in parallel. The linear array system is defined by Kohonen 
(1982) as the one producing one-dimensional topology-preserving mapping for 
ii > i 2 > i 3 . . . when 


y n = max{y,C*i), 

l 

i = 1, 2, . 


yn = max{y,.(x 2 ), 

l 

i =1,2,. 

... n} 

y« 3 = max{y,.(x 3 ), 

i = 1, 2, . 

«} 


(7.26) 



424 


MATCHING AND SELF-ORGANIZING NETWORKS 


NM- - - 


o/o o o o o o o\o o o o 
, fo o/o o o o\o c\o o o o 

N(L) — — — —7 \ \ 

'o 0/0 000 o\o o\o o o 

N t,)-/- -°J9. 9/° °\° °\° °\° 0 0 
mh h oto 0/0 * m o\o c)o 0)0 o 

o\o o \o 0/ 0 0/0 0/0 o o 
\o o\o 000 0/0 0/0 o o 
\o o\o o o 0/0 0/0 OOO 

Winning neuron is m 


Figure 7.18 Topological neighborhood definition, t| < fe < fe. 


This definition can be generalized for two or three dimensions. The topological 
ordering of output is understood in any mapping case as the one preserving the 
neighborhood. The input ordering need not be specific and can follow arbitrary 
metrics and orders. Also, the dimensionality of the input Space vectors is not 
restricted. The basic network on which most of the discussion in this section is 
based is shown in Figure 7.12(a) with feature mapping from an original input 
space into a two-dimensional rectangular neuron array. 

Leaming within self-organizing feature maps results in tinding the best 
matching neuron cells which also activate their spatial neighbors to react to 
the same input. Such collective and cooperative leaming tunes the network in an 
orderly fashion by defining some feature coordinates over the trained network. 
After leaming, each input causes a localized response having a position on the 
neurons’ array that reflects the dominant feature characteristics of the input. 

The feature mapping can be thought of as a nonlinear projection of the 
input pattem space on the neurons’ array that represents features. The projection 
makes the topological neighborhood relationship geometrically explicit in low- 
dimensional feature space. Following the formation of self-organized intemal 
representation for existing implicit relationships between original data, the same 
implicit input space relationships become explicit on the spatial neuron map. 

The self-organizing feature map algorithm outlined below has evolved as a 
result of a long series of simulations. The rigorous mathematical analysis of the 
dynamics of the self-organization algorithm and the algorithm ’s proof remain yet 
to be discovered (Kohonen 1990). Thus, our focus in this section is on formula- 
tion of the algorithm, its intpitive interpretation, and sample applications. 

The algorithm is explained below for a planar array of neurons with hexag- 
onal neighborhoods (Kohonen 1990) as shown in Figure 7.18. As illustrated in 
Figure 7.12(a), input x is applied simultaneously to all nodes. Instead of using the 
scalar product metric of similarity, the spatial neighborhood N m is used here as 
a more adequate measure S of similarity between x and w,. The weights affect- 
ing the currently winning neighborhood, N m , undergo adaptation at the current 
leaming step, other weights remain unaffected. The neighborhood N m is found 
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around the best matching node m selected such that 

■II* - w mll = min{||x - w, ||} (7.27) 

The radius of N m should be decreasing as the training progresses, N m (t x ) > 
N m (t 2 ) > N m (t 3 ) .... where t x < t 2 < t 3 — The radius can be very large as 
leaming starts, since it may be needed for initial global ordering of weights. 
The local ordering within the global order would gradually follow thereafter. 
Toward the end of training, the neighborhood may involve no cells other than 
the central winning one. As we realize, this leaming neighborhood eventually 
becomes identical to the neighborhood of the simple competitive leaming rule 
of Kohonen’s layer used for cluster leaming covered in Section 7.2. 

The reader can notice that no explicit lateral interactions of the Mexican hat 
type are present in the weight adjustment algorithm. However, the mechanism of 
lateral interaction is embedded in the definition of localized neighborhoods and 
in the subsequent weight adaptation within the neighborhood. 

The weight updating rule for self-organizing feature maps is defined as 

Aw,(r) = a [x(0 - w,(t)] for i e N m (t) (7.28a) 

where N m (t) denotes the current spatial neighborhood. Since leaming constant a 
depends both on training time and the size of the neighborhood, (7.28a) can be 
rewritten in a more detailed expression as follows: 

Aw,(0 = a(N h t ) [x(0 - w,(0] for i e N m (t) (7.28b) 

where a is a positive-valued leaming function, 0 < a(N h t) < 1. Because a 
needs to decrease as leaming progresses, it is often convenient to express it as a 
decreasing function of time. In addition, a can be expressed as a function of the 
neighborhood radius. An example of a function yielding good practical results 
in a series of simulations (Kohonen 1990) is 

a(Ni,t) = a(t) exp -||r,- - r m ||/<r 2 (o] (7.29) 

where r m and r, are the position vectors of the winning cell and of the win- 
ning neighborhood nodes, respectively, and a{t) and cr{t) are suitably decreasing 
functions of leaming time t . 

It seems that the main conditions for the self-organization of the Kohonen’s 
feature map are as follows: (1) The neurons are exposed to a sufficient number 
of inputs; (2) only the weights leading to an excited neighborhood of the map 
are affected; and (3) the adjustment is in proportion to the activation received 
by each neuron within the neighborhood. As a result, the weight adaptation rule 
tends to enhance the same responses to a sufficiently similar subsequent input. 

As a result of weight adjustment algorithm implementation, a planar neuron 
map is obtained with weights coding the stationary probability density function 
p(x) of the pattem vectors used for training. The ordered image of pattems forms 
the weights w t j. 
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(b) (c) 

Figure 7.19 Self-organizing feature mapping example: (a) list of patterns, (b) feature map pro- 
duced after training, and (c) minimum spanning tree. [from Kohonen (1984). © Springer Verlag; 
reprinted with permission.] 


An interesting example of the mapping of five-dimensional data vectors is 
shown in Figure 7.19 (Kohonen 1984). Figure 7.19(a) lists 32 different five- 
dimensional input vectors labeled A through 6. The rectangular array of features 
consists of 70 neurons, each connected by Ave weights with pattem components 
x l9 x 2 , x 3 , jc 4 , and x 5 . The array has been trained using vectors x A ...x 6 selected at 
random from the training set. After 10,000 training steps the weights stabilized 
and then the obtained network was calibrated. Calibration was performed by 
supervised labeling of array neurons in response to a specific known vector from 
the training set. Thus, when vector B was input, the upper left comer neuron 
from Figure 7.19(b) produced the strongest response in the entire array. It was 
therefore labeled B , etc. As seen from the figure, 32 neurons are provided with 
labels and 38 neurons remained uncommitted. Let us analyze the meaning of the 
obtained map. 

Inspection of the similarity of the training vectors reveals that vectors 
Xj . . . x 6 can be arranged in a so-called minimum spanning tree as illustrated 
in Figure 7.19(c). The minimum spanning tree is a technique for investigating 
clustering or neighborhoods of data (Andrews 1972). The approach results from 
a graph theoretical analysis of an arbitrary data set. The minimum spanning tree 
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is defined as a planar tree connecting all points in the data set to their closest 
neighbors such that the total tree length is minimized. Each vertex of the graph 
corresponds to only one element from the data set. If we assign a number to 
each graph edge that is equal to the distance between the vertices that the edge 
connects, then the minimum spanning tree of a set is that spanning tree which 
has the sum of the edges of the tree of minimum value. 

By inspecting the data x A ...x 6 , the distance between x A and x B is deter- 
mined to be 

\\*a ~ x bII = 1 < K “ X/IU for i = C, . . . , 6 

Thus the edge connecting A and B should probably belong to the minimum 
spanning tree. Further, since we have 

ll x s - *cll = 1 < ll x s - x ;ll> for / = 6 


then the edge connecting B and C will probably be part of the tree. Since the 
distances both between x c and x D , and x c and x F are both of unity value, the 
tree would be likely to include the edges CD and CF. The reader can continue 
generating the minimum spanning tree by inspecting distances between the data 
and growing branch after branch of the planted tree. 

The feature map produced through self-organization has the same structure 
as the minimum spanning tree. Each piece of data corresponds to a single neuron 
cell, namely to the best matching one. Thus, the neuron represents its image on 
the feature array. The data at minimum distances in original five-dimensional 
space are arranged into topological neighborhoods in two-dimensional feature 
space. The distances between the vectors x a and x^, defined in the original 
pattem space as the norm 


Xfl = 



are preserved in the feature space where they become planar distances. Planar 
distances are perceptually very pervasive when compared with five-dimensional 
space distances. As can be seen from the map, distances within multidimensional 
space between 32 data items are now flattened on the plane. By connecting the 
topological neighbors on the plane, we obtain the minimum spanning tree. The 
obtained map of features is somewhat squeezed into the minimum space, but the 
topological relations are preserved on the map with some parts of the minimum 
spanning tree lightly bent on the map. 


EXAMPLE 7.4 


In this example, a linear array of 10 neurons is trained using one-dimensional 
data x distributed uniformly between 0 and 1. The data with equal probability 
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Figure 7.20a, b Linear array for Example 7.4: (a) network diagram, (b) weights after training, 10 
experiments. 


of occurrence are thus mapped from one-dimensional pattem space to ten- 
neuron, one-dimensional linear feature space. 

The network arrangement is shown in Figure 7.20(a). Initial weights are 
random and centered around 0.5 with 0.05 radius. The initial neighborhood 
radius of one node has been reduced to zero nodes after 300 training steps. 
The results of 1000 iterations are tabulated in Figure 7.20(b). Each of the 10 
rows lists final weights. Each of the 10 computational training experiments 
started at different initial values and was performed with data submitted in 
different random sequences. As can be seen, the initially unordered weights 
can be ordered to become approximately linear. This can be done by cali- 
brating the trained network through applying known inputs between 0 and 1 
that are equally spaced in this range. The average value for a weight matrix 
obtained in the series of 10 simulations can be computed as 

W = ^L[6 17 28 39 47 57 66 75 85 95]' 

The weights obtained in the experiment are close to the empirical cumulative 
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Figure 7.20c Linear array for Example 7.4 (continued): (c) weight values for an ordered linear 
array. 


distribution function of a uniform sample of numbers from the range [0,1]. 
Figure 7.20(c) shows that weights become similarly organized to the training 
pattems. h 

When applying the self-organizing feature mapping algorithm, formulas 
(7.27) and (7.28) altemate. Input x, even if deterministic and defined, is treated 
as having a probability density of occurrence. The weight adjustments under 
Kohonen’s algorithm reflect this probability of occurrence. In faet, this leaming 
algorithm summarizes the history of the input data along with their topological 
relationships in the original pattem space. Moreover, in real-world training situ- 
ations, input x should involve pattems in their natural sequence of occurrence. 

Numerous technical reports have been written about successful applications 
of the feature map algorithm (Kohonen 1984). The algorithm has been applied 
in sensory mapping, robot control, vector quantization, and speech recognition. 
One explanatory example of feature mapping is shown in Figure 7.21. The figure 
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Figure 7.21 Self-organizing feature map for two input nodes, 25 x 25 array of neurons, and 
uniformly distributed excitation [from Kohonen (1990). © IEEE; reprinted with permission.] 


shows the initial and intermediate phases of self-organization of a square feature 
map for a two-dimensional input vector. The training has been performed on 
random independent inputs uniformly distributed over the unity square shown 
in each of the six boxes. In faet, each of the boxes displays the distribution 
of weights during training. The initial weights are selected to be random and 
close to the values of 0.5 and 0.5. The final weights, w n and w /2 , of the square 
array are shown after 100 000 training iterations in the last of the six boxes. 
The horizontal and vertical coordinates of the plots are the resulting values 
w n and w i2 , respectively, for an array of 25 X 25 neurons. Line intersections 
of the maps specify weight values for a single i’th neuron. Lines between the 
nodes on the graph merely connect weight points for neurons that are topological 
nearest neighbors. 

In practical applications of self-organizing feature maps, input vectors are 
often high dimensional. In speech recognition, for example, there are typically 
15-tuple input pattem vectors. During the initial processing of the speech signal, 
the microphone signal is first converted into a 15-channel fast Fourier trans- 
form (FFT) spectral signal covering the range of frequencies from 200 hertz 
to 5 kilohertz. Each phoneme of speech tends to occupy one segment of the 
15-dimensional spectral signal Space. Although spectra of different phonemes par- 
tially overlap and 100 percent accurate discrimination is not completely achiev- 
able by any method of phoneme classification, an array of properly trained 
neurons is able to extract most phoneme features and project them on the plane 
as described below. 

The planar neural array has been trained on spectral signals generated every 
9.83 ms during the continuous speech (Kohonen 1988). The spectral FFT samples 
have been used for the weight adaptation algorithm (7.27) and (7.28) in their 
natural order of occurrence. During training, various neurons become sensitized 
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Figure 7.22 Speech phoneme map after training. [from Kohonen (1990). © IEEE; reprinted 
with permission.] 


to spectra of different phonemes. As a result, the trained feature array shown in 
Figure 7.22 has been generated. The figure displays a phoneme map of natural 
Finnish speech. The initially unlabeled map was subsequently calibrated using 
standard reference phoneme spectra. As can be seen, most cells that leamed a 
phoneme have acquired a unique phoneme characterization. Some cells indicate 
responses to two different phonemes. Discrimination of overlapping phonemes 
k , /?, and t has not worked too reliably and would need an additional phoneme 
map discrimination. 

The map facilitates the planar visualization of complex multidimensional 
waveforms of speech. The visualization of speech phonemes may be useful for 
speech therapy or training, but the ultimate goal of speech phoneme process- 
ing is automatic speech recognition. Let us note that Figure 7.22, showing the 
phoneme map, can be thought of as the keyboard of a “phonetic” typewriter 
(Kohonen 1988). The reader may also refer to Figure 1.11 showing visualization 
of continuous speech signals on phoneme maps in the form of word trajectories. 
Generation of word trajectories points out another very interesting application of 
Kohonen ’s feature map. 

It is often reported in the technical literature that the self-organizing map’s 
successful training depends on the initial weights and the selection of training 
parameters. Several practical conditions are of importance for the user attempting 
the implementation of the algorithm. They are discussed below (Kohonen 1990). 

The number of weight adjustment steps should be expected to be reasonably 
large since the leaming here is a statistical process. A rule of thumb is that, for 
good statistical accuracy, the number of steps should be at least 500 times larger 
than the number of array neurons. Note that although 100 000 training steps are 
not uncommon, the algorithm is rather simple computationally. Also, intermediate 
results may be recycled, in particular when the set of training samples is not large. 

For approximately 1000 initial steps, the leaming function a(t) in (7.29) 
should be kept close to unity, then it should start decreasing at t = t p . Linear or 
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exponential decreases are exemplified below. The leaming function a(t) can be 
specified for the discussed choices as follows: 


a(t) = a 0 . 

for t < t p 

(7.30a) 

and for t > t p we have linear decrease 

l 

t - t„\ 


II 

o R 

*- </) 

(7.30b) 

or, altematively, exponential decrease 



a(t) = a 0 exp 

k-t-t p )/t q ] 

(7.30c) 


where t q is the time length responsible for the rate of a{t) decay. Its value should 
be suitably chosen by the user. After the initial map ordering for large a values, 
a(t) should be kept to a small positive value in the range of 0.01 over an extended 
period of training. This would suggest that a decelerated decay of a originally 
chosen as in (7.30b) and (7.30c) may be suitable in many cases. 

The criteria for selection of the controlled neighborhood radius was intro- 
duced in (7.29). At the beginning of training, the radius of the neighborhood can 
cover the entire network. After the gradual reduction of the neighborhood radius, 
the network should be continuously trained in the fine weight adjustment mode 
with nearest neighbors only. Then, the neighborhood should shrink to zero and 
only include the neuron itself. 


CLUSTER DISCOVERY 
NETWORK (ÅRTI) 

The network covered in this section was developed by Carpenter and Gross- 
berg (1987, 1988) and is called an adaptive resonance theory 1 (ÅRTI) network. 
It serves the purpose of cluster discovery. Similar to networks using a single 
Kohonen layer with competitive leaming neurons covered in Section 7.2, this 
network leams clusters in an unsupervised mode. The novel property of the 
ÅRTI network is the controlled discovery of clusters. In addition, the ÅRTI 
network can accommodate new clusters without affecting the storage or recall 
capabilities for clusters already leamed. 

The; network produces clusters by itself, if such clusters are identified in in- 
put data, and stores the clustering information about pattems or features without 
a priori information about the possible number and type of clusters. Essentially 
the network “follows the leader” after it originates the first cluster with the first 
input pattem received. It then creates the second cluster if the distance of the 
second pattem exceeds a certain threshold, otherwise the pattem is clustered with 
the first cluster. This process of pattem inspection followed by either new cluster 
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y, y-, Ym 



(Only some top-down weights shown) 

Figure 7.23 Network for discovering clusters (elements computing norms for the vigilance test 
and elements performing the vigilance test and disabling yj are not shown). 

origination or acceptance of the pattem to the old cluster is the main step of 
ÅRTI network production. A more detailed description of the training and its 
algorithm is given below. 

The central part of the ÅRTI network computes the matching score reflecting 
the degree of similarity of the present input to the previously encoded clusters. 
This is done by the topmost layer of the network in Figure 7.23 performing 
bottom-to-top processing. This part of the network is functionally identical to 
the Hamming network and MAXNET from Figure 7.1(a). The initializing input 
to the m’th node of MAXNET is the familiar scalar product similarity measure 
between the input x and the vector w m . We thus have the initial matching scores 
of values 

y° m = w m x, for m = 1, 2, .... Af (7.31) 

where w OT = [ w lm w 2m . . . w nm ] \ Note that the double subscript convention 
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for weights w ( j in this section (and local for only this section of the text) is not 
followed. The first weight index denotes input node number “from,” the second 
index denotes node number “to.” This is to conform with common notation used 
in the technical literature for this network. The activation function f(net) for 
the MAXNET neuron is shown in Figure 7.3(b) and given by (7.5b). Jt is also 
assumed that a unity delay element stores each MAXNET neuron output signal 
during the unity time A during recursions, before it arrives back at the top-layer 
node input. The input of the topmost layer is initialized with vector y°, entries of 
which are computed as matching scores (7.31), and thereafter the layer undergoes 
recurrent updates as expressed in (7.5a). We thus have for this portion of the 
network 

y* +1 = r[W M y*] (7.32) 

where the weight matrix W M is defined as in (7.4) and fulfills all related con- 
ditions of the MAXNET weight matrix as stated in Section 7.1. The initializing 
matching scores vector y° = net 0 for (7.32) and for the top-layer recurrences is 
given by the simple feedforward mapping 

y° = Wx (7.33) 


where W is the bottom-to-top processing weight matrix containing entries w tJ as 
follows 


w u 

w 2l ■ 

■■ W nl 

W 12 

W 22 ■ 

■■ W n2 

Wl M 

™2M ■ 

■■ W nM 


As for the MAXNET network, the single nonzero output for a large enough re- 
cursion index k , y* +1 , is produced by the /th top-layer neuron. For this winning 
neuron we have 



n 


I w ,j x < = 


max 



(7.34) 


The above discussion is pertinent to the matching score computation of the cluster 
discovery algorithm. Formulas (7.32), (7.33), and (7.34) are used for calculation 
of the subscript associated with the largest value y^, m = 1, 2, . . . , Af. This 
is equivalent to the search for the winning node of the top layer. After recur- 
sions in the top layer have stabilized, the number of the node with the nonzero 
output, say j, is the potential cluster number. As we have realized by now, this 
part of computation also provides the identification of the weight vector w j that 
best matches the input vector x of all vectors Wj, w 2 , ...» w M , in the scalar 
product metric. 

The top-down part of the network checks the similarity of the candidate 
cluster with the stored cluster reference data and performs the vigilance test 
on normalized v ij x i entries as described below. The vigilance threshold S, 
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Update W, V 
for the cluster 


Yes 


Figure 7.24 Flowchart of the ÅRTI encoding algorithm for unipoiar binary inputs. 



0 < S < 1, sets the degree of required similarity, or “match,” between a cluster 
or pattem already stored in the ÅRTI network and the current input in order 
for this new pattem to resonate with the encoded one. If the match between the 
input vector and fth cluster is found, the network is said to be in resonance. Let 
us review the steps of the leaming algorithm for cluster discovery (Lippmann 
1987; Pao 1989). The cluster encoding algorithm flowchart that we follow below 
is shown in Figure 7.24. 
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Summary of the Adaptive Resonance Theory 1 NetWork Algorithm (ART 1) 


Step 1 : The vigilance threshold p is set, and for the n-tuple input 
vectors and M top-layer neurons the weights are initialized. The 
matrices W, V are (Af X n) and each is initialized with identical 
entries: 


W = 


1 


1 + n J 
V- [1] 

0 < p < 1 


(7.35a) 

(7.35b) 

(7.35c) 


Step 2: Binary unipolar input vector x is presented at input nodes, 
Xj = 0, 1, for i = 1, 2, . . . , n. 

Step 3: All matching scores are computed from (7.31) or (7.33) as 
follows: 

y° m = 1 w im x h for m = 1, 2, . . . , M (7.36) 

i=i 

In this step, selection of the best matching existing cluster, j, is 
performed according to the maximum criterium (7.34) as follows: 


4 >o - 


max (y ° m ) 


(7.37) 


[A practical note is that the recurrences of (7.32) done by 
MAXNET may be skipped in a discrete-time leaming simulation 
since their only consequence of importance is suppressing to zero 
all upper node outputs but the y’th one.] 

Step 4: The similarity test for the winning neuron j is performed as 
follows: 

tAm I v i j x i > P (7.38) 

11*11 i=l 

where p is the vigilance parameter and the norm ||x|| is defined 
for the purpose of this algorithm as follows: 

M „ A 


n 

I 

r=l 


(7.39) 


If the test (7.38) is passed, the algorithm goes to Step 5. 

If the test has failed, the algorithm goes to Step 6 only if the 
top layer has more than a single active node left. Otherwise, the 
algorithm goes to Step 5. 

Step 5: Entries of the weight matrices are updated for index j pass- 
ing the test of Step 4. The updates are only for entries (ij). 
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where i — 1, 2, 


. , AT, and are computed as follows: 
W,j( 0.5 + i; =1 Vij (t) Xi 


(7.40a) 


(7.40b) 


Vij(t + 1) = X^jit) 

This updates the weights of the / th cluster (newly created or 
the existing one). The algorithm retums to Step 2. 

Step 6: The node j is deactivated by setting y- to 0. Thus, this mode 
does not participate in the current cluster search. The algorithm 
goes back to Step 3 and it will attempt to establish a new cluster 
different than j for the pattem under test. 


In Step 4, the vigilance test is performed on the input vector by comparing 
the scalar product of the best matching cluster weights \j with the input vector 
divided by the number of bits equal to 1 of the input vector. If the ratio is less than 
the vigilance threshold, the input is considered to be different from previously 
stored cluster exemplars. If, at the same time, there are no cluster nodes left for 
disabling, a new cluster is added. Addition of a new cluster exemplar requires 
adding one new node and 2 n new connections. 

The vigilance threshold p is typically a fraction indicating how close an 
input must be to a stored cluster prototype to provide a desirable match. A value 
close to 1 indicates that a close match is required; smaller values indicate that 
a poorer match is acceptable. Thus, for lower values of p, the same set of input 
pattems may result in a smaller number of clusters being discovered by the ÅRTI 
network. 

The cluster j is stored in the form of weight vector w y as defined in (7.31). 
These encoded vectors constitute so-called long-term network memory . In com- 
parison, weights v,y used for verification of cluster exemplar proximity can be 
considered to be the short-term memory of the network. In summary, the ÅRTI 
network leaming algorithm described performs an off-line search through the 
encoded cluster exemplars and is trying to find a sufficiently close match. If no 
match is found, a new class is created. 

The recall of stored pattems by this network can be interpreted as identical 
to leaming; however, the vigilance test is always passed and no weight updates 
are performed. A complete description of discrete-time recall in the ÅRTI net- 
work involves complex difference equations, since the system involves numerous 
feedback loops of which only autonomous competitive top-layer recurrences were 
taken into account in this discussion. The name of the system comes from the 
resonance that the system finds in Step 4 when the pattem possibly resonates 
with one of the pattems, say of class j, and reinforces the storage for this class. 
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EXAMPLE 7.5 


This example illustrates stages of discrete-time leaming of an example ÅRTI 
network with four category neurons indicating the potential for four clusters 
and a 25-dimensional input vector with entries 0, 1 (Stork 1989). We assume 
here that the W and V matrices are of size (25 X 4). Although we are not 
sure initially whether or not four top-layer cluster neurons are all needed, 
we consider the choice M — 4 as the only reasonable network initialization. 
If the number of clusters results after training as smaller than the number 
of input vectors, the actual number of cluster neurons needed in the top 
layer would reduce accordingly. Simulations of leaming with two different 
vigilance values are shown in Figure 7.25. The vectors representing the bit 
maps of input pattems are arranged in column vectors in the sequence in 
which they are presented, left to right. 

Consider first the high-vigilance case shown in Figure 7.25(a) of p = 
0.7. Weights are initialized as 

1 

W ‘j ~ 26 

Vij =1, i = 25, j = 1, 2, 3, 4 

Step 1 : When pattem A is presented, one of the four top-layer neurons 
has the largest output. It is arbitrarily denoted as neuron number 1. The 
vigilance test is unconditionally passed since the left side of (7.38) is 
of unity value in the first pass. This results in unconditional definition 
of the first cluster. The result of the weight adjustment computed from 
(7.40a) is that all weights w n connecting inputs having x t = 1 with 
the top-level node 1 are increased to 2/11. Also, top-down weights 
v n adjusted according to (7.40b) remain at unity value as initialized if 
x t = 1; otherwise they are set to 0. 

Figure 7.26 shows the ÅRTI network after completion of the first 
training step. Note that from among 25 X 4 = 100 weights that process 
raw input from the bottom to the top, w,y, only 5 have changed. From 
among 100 top-to-bottom weights, v,y, 5 are set to the unity value and 
19 are zeroed. All the remaining weights remain as initialized and are 
not shown in the figure. 

Summary of first pass of al gorit hm , pattern A is input : 

2 

*1,1 = w 7.1 = W 13,l = *19.1 = W 25A = — 

The remaining weights w n = 1/26 as initialized. 

v i,i = v 7 ,i = v i 3 ,i = v i 9 ,i = v 25 ,i = 1 as initialized. 

The remaining weights are recomputed as v n = 0. 
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Input 

pattems 



m 


M 



High vigilance (p=0.7) 
(a) 






Categories 

1 2 



Low vigilance (p=0.3) 


(b) 


3 


1 


Flgure 7.25 Discrete-time learning simulation of an example ÅRTI network. [from Carpenter 
and Grossberg (1987). © Academic Press; reprinted with permission.] 


Step 2: During the presentation of input pattem B there is no top-layer 
node competing for clustering, since the only active node is 1. The 
vigilance test results in 


1 25 / 1 \ 


Due to the failure of the vigilance test and the absence of other nodes 
for further evaluation and for potential disabling, pattem B is treated as 
a new cluster. The cluster is therefore represented by another neuron, 
arbitrarily numbered 2. Its connecting weights are recomputed as w a 
and v j2 . 

Summary of second pass of algorithm, pattern B is input: 


w l,2 = w 5,2 = w l,2 ~ w 9,2 = W 13.2 = W I7,2 

2 

~ ^ 19, 2 ~ w 21,2 — ^ 25,2 — 
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The layers are fully connected with weights as 
initialized, with the exception of i,= 0, i#l, 7, 13, 19, 25 


Figure 7.26 The ÅRTI network from Example 7.5 after the first training step. 


The remaining weights w a = 1/26. 


v l,2 - v 5,2 — v 7,2 — v 9,2 ~ v 13,2 — v 17,2 — v 19,2 ~ v 21,2 — v 25,2 — 1 


The remaining weights v (2 = 0. 

Step 3: During the presentation of pattem C we have the following match- 
ing scores computed from (7.34): 


= 5 (n) + 8 (é) * 1217 
^ = 9 (å) + 4 (s) = U01 


Thus, neuron 1 results as a winner. However, the vigilance test is not 
passed since 


1 

INI 


25 5 

I v n x i = TT < °- 7 
/=! 13 


Node 1 is then disabled and, consequently, node 2 emerges as a winner 
because of the absence of a competitor node. The vigilance test yields 


1 

H 


25 9 

I v i2*i = < °- 7 

/=! 13 


and it fails again, indicating therefore inadequate similarity between the 
input C and any of the two existing clusters. 
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Thus, top-layer neuron numbered 3 is added as a new category and 
weights w i3 and v i3 are updated accordingly. In contrast to the recent re- 
jection of existing clusters, pattem D is classified to category 3 because 
of the similarity with pattem C. This is due to the faet that the vigilance 
test was passed and to the resonance of pattem D within the existing 
category 3. Notice that the weights computed originally within the third 
pass of the algorithm and connecting to and from top-layer neuron 3 
have been affeeted in the fourth pass. After pattems A, B, C, and D 
have been categorized and recorded, when subsequently presented to 
the network, each of them will access the appropriate category neuron 
of the top layer and resonate with it. Noticeably, raising the vigilance 
level to 0.9 or 0.95 will entirely prevent the network from recoding 
within existing clusters. Therefore, using the training algorithm with 
this value of the vigilance level enables pattem recall. 

The results of ÅRTI network leaming with low vigilance set at 
0.3 are shown in Figure 7.25(b). Just two categories are formed in 
this case. If the vigilance is set at 0.8, however, four categories are 
produced, one for each pattem. Lowering it to 0.2 would cause only 
one diffuse category to be formed based on the pattem set A, B, C, 
and D . All of the four pattems would be categorized as of the same 
category. ■■ 

The ÅRTI network discussed here is a simplified version of the network 
working only with unipolar binary inputs. The network performs very well with 
perfeet input pattems, but even a small amount of noise superimposed on train- 
ing pattems can cause problems (Lippmann 1987). With no noise, the vigilance 
threshold can be set such that two pattems that are somewhat similar but do differ 
are considered to be different by the network. However, for training using noisy 
pattems, they can become categorized under noisy conditions. Under no noise 
conditions, the algorithm shows good performance in both leaming and recall. It 
is reported to be stable and reliable for clustering and classifying unipolar binary 
input vectors. 


EXAMPLE 7.6 

This example demonstrates the storage of pattems under noisy conditions. 
The ÅRTI network is developed using the ANS program from the Appendix. 
Assume that there are four noise-free unipolar binary pattem vectors rep- 
resenting bit maps of the characters A, B, C, and D, which are shown in 
Figure 7.27(a). The bit maps are of size 5X5, like those in the previous 
example. 
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(a) 


Input 

pattem 

number 

Clustering 

Desired 

resonance 

cluster 

Vigilance level 
0.95 0.90 0.85 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

3 

3 

3 

3 

3 

4 

4 

4 

4 

4 

5 

5 

5 

1 

1 

6 

6 

6 

5 

1 

7 

7 

7 

2 

2 

8 

8 

7 

6 

2 

9 

7 

7 

3 

3 

10 

3 

3 

3 

3 

11 

9 

8 

4 

4 

12 

8 

7 

4 

4 


Figure 7.27a, b Clustering of patterns in Example 7.6: (a) noise-free prototypes, (b) cluster 
generation by the ÅRTI network. 


Training of the network is performed with pattems 1 through 4 first. As 
a result, four clusters are expected to be built and encoded in the network, 
one cluster for each respective character. As can be seen from Figure 7.27(b), 
for each of the three vigilance levels selected for this experiment equal to 
0.95, 0.9, and 0.85, four clusters are appropriately generated. 

Consider that a distorted version of character A shown in Figure 7. 27 (c) 
is submitted as input to the network as the fifth pattem. From the fifth row 
of the table of Figure 7.27(b) it can be seen that the input fails to cluster it 
with the previously encoded character A unless the vigilance level is low- 
ered to 0.85. When the sixth pattem representing another distorted version 
of character A is input, however, the network fails to cluster it with node 
1 under any of the three vigilance levels used for clustering of the fifth 
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character. It creates therefore another category that is either number 5 or 
6, dependent on the vigilance level previously used for the fifth character ’s 
processing. 

The seventh and eighth input characters represent distorted versions 
of character B associated with the cluster neuron 2. They both define the 
new clusters for the highest vigilance level. For the lowest of the vigilance 
levels, the seventh input character resonates properly with the second node, 
however, another new cluster is built for the eighth input. The reader can 
see the tabulated results of clustering for the remaining pattems numbered 
9 through 12 from the figure. Note that the network is unable to cluster all 
noisy versions of pattems under the appropriate category number, but it can 
do this for some of them. If, however, the vigilance level is lowered below 
0.85, then original pattems 1 through 4 used to develop the network do not 
all form separate categories in the first four training steps. This example 
illustrates what limitations can be encountered by the user when training 
the ÅRTI with noisy versions of the prototype vectors. mm 

The reader has probably noticed that most of the neural networks studied 
thus far have the property that leaming a new pattem, or association, influences 
the recall of existing or previously stored information in the network. When a 
fully trained network must leam a new mapping in a supervised mode, it may 
be even required to releam from the beginning. Similarly, when a new pattem 
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is added to the associative or autoassociative memory, this may affect the recall 
of all previously stored data. 

The ÅRTI network explores a radically different configuration. Note that 
the network works with a finite supply of output units, Af. They are not used, 
however, until needed to indicate a cluster, and can be considered as waiting for 
a potential cluster to be defined. The network, therefore, exhibits a degree of 
plasticity when acquiring new cluster data. At the same time, the recall of the 
data already leamed is not affected, and the weights for clusters already stored are 
not modifiable. Although the formal mathematical description and convergence 
proofs of the model have been omitted in our presentation, the basic concepts of 
leaming and recall provided should enable both the understanding and practical 
use of this important neurocomputing algorithm. 


CONCLUDING REMARKS 

Our focus in this chapter has been on networks that leam in an unsupervised 
mode based on the specified proximity criterion. Except for Hamming network, 
all networks leamed either the similarity between inputs or their input /output 
mapping without supervision. The proximity, or similarity measures, are em- 
bedded in an unsupervised leaming algorithm to enable differentiation between 
similar and dissimilar inputs. In addition, the differentiation capability has been 
gradually developed within the network based on the leaming experience. 

We discussed for the most part self-organization, which uses the scalar prod- 
uct similarity measure for developing clusters. We also covered self-organization 
of pattems into such images in geometrically regular neuron arrays that it re- 
flects the input pattems’ mutual structure. Self-organizing arrays have been able 
to translate clusters in n-dimensional input pattem Space into clustering in low- 
dimensional feature, or image, space. Such self-organizing has preserved and 
enhanced the topological relationships present in the original input space. 

The response of a self-organizing neural network can be used to encode 
input in fewer dimensions while keeping as much of the vital input information 
as possible. We have seen that a drastic dimensionality reduction of input data can 
take place in situations of simultaneous clustering and feature mapping. Indeed, a 
space with a high dimensionality of input data is generally expected to be mostly 
empty, and the nonempty parts of the space can be identified and projected into 
regular array of neurons responding with images that are easy to perceive. 

As we have seen, the competitive leaming rule enables the unsupervised 
training process to work. Winner-take-all leaming and an adjustable neighbor- 
hood definition have been used for weight formation. In addition, extensive lateral 
inhibition equivalent to mixed positive and negative feedback has provided the 
mechanism for selecting the winning neuron in regularly shaped feature arrays. 
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Most networks in this chapter operate through discrimination of the weight 
to pattem similarity. The input /output mapping is typically implemented by 
activating only those neurons whose weights exhibit leamed similarity to the 
present input. Based on this detected similarity, the network output is generated, 
sometimes directly by the receiving layer, but often by the following layer. 

The networks covered in this chapter have found many applications. They 
include minimum-distance vector classification, vector quantization, auto- and 
heteroassociative memory, classification of noisy vectors into clusters, and clus- 
ter synthesis. The class of un§upervised leaming neural networks is particularly 
well suited to applications that require data quantization. Examples where the 
capability of data quantization is useful include statistical analysis, codebook 
communication, and transmission and data compression. The idea of data quan- 
tization is to categorize a given set of random input vectors through a finite 
number of classes. The index of the class is found as a number of the output 
neuron that fires. We then can transmit or store the index of the class rather 
than complete data. This, however, requires a list of codes, or codebook, that 
has been agreed upon. Such a codebook must be available at the receiving end 
of the information transmission system, since the data encoded by the network 
at the transmitter side must now be decoded. 

Finally, self-organizing feature maps can be applied for extracting the hidden 
features of multi-dimensional input pattems and their probability of occurrence. 
The inputs are mapped to the line or plane of output units so that neighborhood 
relations are transferred from the input space to the neuron output space. This is 
especially useful for detection of hidden relationships that are difficult to group, 
organize or categorize in a sensible manner. One of the interesting applications 
of the map is for human language study and is presented in Section 8.6. 


PROBLEMS 

Piease note that problems highlighted with an asterisk (*) are typically com- 
putationally intensive and the use of programs is advisable to solve them. 

P7.1 Design the Hamming network for the three following class prototype vec- 
tors: 

s (1) =[l -1 1 -1 

s (2) =[-l -1 -1 - 

s (3) =[l 11-1- 

(a) Compute the weight matrix. 

(b) Find net m , for m = 1,2, 3, for the input vector x = [111111] 
and verify that the HD parameter computed by the network agrees 
with the actual HD. 


i -i]', 
i -i i]‘, 

i ir 
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(c) Find the normalized responses f(net m ) at the output of each node of 
the Hamming network. 

P7.2 MAXNET with four input nodes has been designed for s = 0.1. After the 
two initial recurrences completed, the following network output resulted: 

y 2 = [0.186 0.307 0.428 0.549]' 

Based on these data: 

(a) Find the initializing vector y° that has been applied initially, prior to 
recurrences. 

(b) Knowing that y° is the output vector of the Hamming network, de- 
termine the HD value between the closest 10-bit prototype and the 
Hamming network input vector originally applied to yield the com- 
puted y°. 

P73 The MAXNET with four output nodes, p = 4, receives the input vector 

y° = [0.5 0.6 0.7 0.8]' 

(a) Find the s value that would be required to suppress the output of the 
weakest node exactly to the zero value after the first cycle. 

(b) Find subsequent responses of the network, y 1 and y 2 , for the computed 
value of s. 

P7.4* Design the Hamming network and the associated MAXNET network to 
provide optimum classification of bipolar binary vectors representing char- 
acters A, /, and O in Problem P4.16 in terms of their minimum Hamming 
distance. 


P7.5 


P7.6 


Assume that the following condition holds for the vector initializing a 
MAXNET network with p nodes: 


o <y°<y° <yj< 1 


Prove that the following increment condition holds during recurrences in 
the linear mode: 


; v * +1 - y k 
wp yp 


< 


,k + 1 


y p -i “Vi 


< 


\y r 
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Given are three cluster centers as 
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-4 


5 


-2 


0 


Compute the weights of the network that has three cluster detecting neu- 
rons. Perform a computation based on the requirement that weights must 
match such pattems. Assume that the normalized weight values of the 
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three neurons have initially been tuned to the normalized values x t , x 2 , 
and x 3 , and then denormalized in inverse proportion to each pattem’s 
distance from the origin. Calculate the responses of each neuron in the 
recall mode for f (net) = i+ CX p(- net ) ~~ * w ith inputs x l9 x 2 , and x 3 . 

P7.7 Perform the first leaming cycle using the following normalized pattem 
set: {x 1 ,x 2 ,x 3 ,x 4 } = { 1^_ 45°, 14_ — 135°, 14_90°, l 2 $_ — 180°} and a - 0.5, 
using the winner-take-all training rule for two cluster neurons. Draw the 
resulting separating hyperplanes. Initial weights are to be assumed Wj = 
[ 1 0] r and w° = [0 -1 ] r . 

P7.8 Solve Problem P7.7 graphically using vector subtraction/ addition, and 
renormalization, if needed. Then reduce a to 0.25 and analyze the impact 
of reduced leaming constant a on the initial phase of training. 

P7.9* Implement the winner-take-all leaming algorithm of the Kohonen layer 
for clustering noisy characters C, /, and T shown in Figure 4.19 into each 
of the respective three classes. Assume that the noisy training characters 
are at distance HD = 1 from the noise-free prototypes. Compute the final 
weight vectors w c , w 7 , and w r . During the recall operation, compute the 
responses f(net c ), f(netj) 9 and f(net T ) of each neuron with continuous 
unipolar acti vation function. Use inputs that are respective bipolar binary 
vectors as specified in Section 4.5. Assume A = 1. 

P7.10 Analyze the following counterpropagation network for which the feed- 
forward part is only operational. The network is shown in Figure P7.10. 
Assume the bipolar binary input and output vectors. Provide results for 
three- and two-dimensional cubes indicating cluster centers, which the 
network is supposed to detect and subsequently map into output vectors. 

P7.ll The feedforward part of a counterpropagation network is shown in Figure 
P7.11(a). It is designed to map two bit maps represented by vectors x x 
and x 2 shown in Figure P7.11(b) into two three-tuple vectors, z l and z 2 . 
Identify vectors z x and z 2 by analyzing the network. Also, complete the 
design of the Kohonen layer. 

P7.12 The objective in this problem is to design a so-called Gray code converter 
of bipolar binary numbers using the feedforward part processing of the 
counterpropagation network. The Gray code converter has the property 
that when x, indices increase as 0, 1, 2, 3, only a single output bit of z, 
changes. 

(a) Design the first layer of the converter as a Kohonen layer, which 
converts four two-bit input vectors 
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f(net ) = 



(training) 



net^O 
net> 0 


(recall) 


Weight= + 1 

Weight= — 1 


Figure P7.10 Feedforward network for analysis in Problem P7.10. 


into, respectively. 



(b) Design the second layer of the converter as a Grossberg layer of 
the feedforward counterpropagation network. Find V such that the 
corresponding converter outputs are 



Use the explicit design equation (instead of training on noisy input 
data) to implement mapping and to obtain W and V. 

P7.13 Design the feedforward part of the counterpropagation network that per- 
forms the operation XOR (x x ,x 29 x 3 ) such that z = x x ®x 2 ®x 3 as shown 
in Figure 3.9(b). Use the explicit design equation instead of training the 
network on noisy data to obtain W and V. Assume x t = ±1 and z { = ±1. 
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(recall) 


(a) 


Weight=-1 



x, = [l-lll]' 



(b) 


Figure P7.11 Feedforward part of the counterpropagation network: (a) diagram and (b) input 
vector bit maps. 


P7.14 Design the feedforward part of the counterpropagation network that asso- 
ciates the binary bipolar vectors representing bit maps of character pairs 
04, C), (1,1), (0,T) as specified in Problem P6.19. Use the explicit design 
formulas to find W and V instead of training the network on noisy data. 

P7.15* Solve Problem P7.14 by using noisy training data for all six characters 
specified. Assume that the noisy vectors are at the distance HD = 1 
from the undistorted prototypes. Train the network in the three following 
modes: 
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Figure P7.16 Feature map for Problem P7.16. 


(a) noisy input vectors only at HD = 1 from input prototypes, outputs 
are undistorted 

(b) noisy output vectors only at HD = 1 from output prototypes, inputs 
are undistorted 

(c) both input and output distorted so that each is at HD = 1 from respec- 
tive prototypes (random bits distorted, no input /output correlation) 

P7.16 The Figure P7.16 shows the final weights of a self-organizing feature 
map that has been trained for the following conditions (Kohonen 1984): 
two-dimensional input vectors have been uniformly distributed over the 
triangular area and the self-organizing array consisted of 15 neurons. Lines 
between points connect topological neighbors only; points indicate weight 
coordinates. Sketch the diagram of the network that, in your opinion, has 
undergone self-organization. Label inputs, neurons, and weights for the 
computational experiment, results of which are reported on the figure. 
Suggest the initial weight values that may have been used. 

P7.17 A set of five four-dimensional vectors is given 
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Implement the self-organizing feature map algorithm to map the given 
vectors onto a 5 X 5 array of neurons. Hexagonal neighborhoods should 
be formed for the array training as shown in Figure P7.17. Decrease a 
linearly from 0.5 to 0.04 for the first 1000 steps, then decrease it for 
steps 1000 to 10 000 from 0.04 to 0. Decrease the neighborhood radius 
from covering the entire array initially (radius of 2 units) to the radius of 
zero value after 1000 steps. Then calibrate the computed map and draw 
the produced minimum spanning tree for inputs x l5 . . . , x 5 . Compare the 
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Figure P7.17 Hexagonal map and inputs for Problem P7.17. 


computed results produced by the network with the predicted minimum 
spanning tree. 

P7.18 Perform three steps of the discrete-time leaming algorithm of the ÅRTI 
network for n = 9 and M = 3 assuming that the training vectors are: 

x, = [l 000 1 000 1]' 

x 2 = [1 1 0 0 1 0 0 1 1]' 

x 3 = [1 0 1 0 1 0 1 0 1]' 

Select an appropriate vigilance threshold p so that all three pattems are 
clustered separately when the sequence x 1? x 2 , and x 3 is presented. Com- 
pute the final weights W, and V of the trained network. 

P7.19 Figure 7.25(a) shows the simulation steps of discrete-time leaming in the 
ÅRTI network for n = 25 and M = 4. The bits of the training vectors 
are 1 and 0 for black and white bit-map pixels, respectively. Find the 
following values of the leaming network: 

(a) the lowest value of p that enables definition of a new cluster when 
pattem B is at the input 

(b) the matching scores y° v y®, and y® of neurons when pattem D is at 
the input 

(c) the lowest value of p that enables definition of a new cluster when 
pattem D is at the input. 

P720 Figure 7.25(b) shows the simulation results of discrete-time leaming in 
the ÅRTI network for n .= 25 and M = 4. The bits of the training vectors 




452 


MATCH1NG AND SELF-ORGANIZING NETWORKS 


are 1 and 0 for black and white pixels, respectively. Find the vigilance 
threshold preventing the occurrence of the second cluster when pattem 
D is at the input. Assume that pattems A, B, and C have already been 
clustered together as shown. 

P7.21* Implement the leaming and recall algorithm of the adaptive resonance 
theory network ÅRTI for the folio wing conditions: 

M = 4,n = 25, and user-specified p. 

Perform the algorithm and program tests by leaming the cluster proto- 
types as shown in Example 7.5. Test the leaming and recall phases for 
the different vigilance levels. 
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8 

APPLICATIONS OF 
NEURAL ALGORITHMS 
AND SYSTEMS 


Ni hi l simul inventum est et per - 
fectum. (Nothing is invented and 
perfected at the same time.) 

LATIN PROVERB 

8.1 Linear Programming Modeling Network 

8.2 Character Recognition Networks 

8.3 Neural Networks Control Applications 

8.4 Networks for Robot Kinematics 

8.5 Connectionist Expert Systems for Medical Diagnosis 

8.6 Self-Organizing Semantic Maps 

8.7 Concluding Remarks 

O ur study thus far has focused on the theoretical framework and math- 

ematical foundations of artificial neural systems. Having understood the 
computational principles of massively parallel interconnected simple “neural” 
processors, we may now put them to good use in the design of practical systems. 
Numerous clues and guidelines for applications of neural algorithms and systems 
were developed in preceding chapters. However, most such applications have 
been presented to illustrate the operating principles of neural networks rather 
than to reflect a practical focus. As we shall see in this chapter, neurocomputing 
architectures are successfully applicable to many real-life problems. Moreover, 
their application potential seems to be enormous but not yet fully realized. 

Examples of diverse applications are tied together in this chapter. The pub- 
lished technical references on applications of neural networks number well over 
a thousand papers and perhaps several hundred more are printed every year. 
Instead of merely enumerating many successful project titles of that list in an 
encyclopaedic manner, the approach taken in this chapter is to study about a 
dozen applications in detail. The applications selected for coverage are believed 
to be representative of the field, and the depth of their jp overage should prepare 
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the readers to pursue their own application project development. However, since 
the area of neural networks is expanding very rapidly, new problems and solution 
approaches may emerge in the near future beyond those presented in this chapter. 

First \ye discuss a single-layer fully coupled feedback network for solving 
classical optimization problems. The network used for this purpose is a gradient- 
type network and it can be designed so that it minimizes the linear cost function 
in the presence of linear constraints. Character recognition networks are discussed 
for both printed and written character evaluation. We shall see that although ordi- 
nary multilayer feedforward networks consisting of continuous perceptrons can be 
successfully used for this purpose, unconventional architectures and approaches 
result in even higher character recognition rates. 

Selected approaches to building neurocontrollers are presented with expla- 
nations of how they make it possible to identify the plants, and how they can be 
used as controllers. Adaptive control issues as well as neurocontrol of the dy- 
namic plant are examined and illustrated with application examples. Subsequently, 
techniques closely related to neurocontrol systems and applied to the control of 
robot kinematics are covered. Feedforward networks capable of performing kine- 
matics transformations, trajectory control, and robot hånd target point leaming 
are examined. 

Next, three connectionist expert systems are presented. The real-life exam- 
ples of medical expert systems for diagnosis of skin diseases, low back pain, and 
heart diseases are examined. Their performance is reviewed and compared with 
the success rate of human experts. Finally, the applications for self-organizing 
semantic feature maps are covered. Their practical use can be envisaged for sen- 
tence processing and extraction of symbolic meaning and contextual information 
implicit in human language. 

In this chapter an attempt is made to review a number of successful imple- 
mentations of massively parallel networks, and to apply them to different tasks. 
We will try to determine the best architectures to accomplish certain functions, 
how the networks should be trained, whether or not they can be useful during 
the training phase, how well they perform in real-life training examples, and 
how they are able to generalize to unknown tasks from the already known ones. 
Most of these questions cannot be answered independently and some of them 
cannot be answered conclusively. By the end of the chapter, however, we will 
have an understanding of how to select and design networks to fulfill a number 
of particular goals. 


LINEAR PROGRAMMING 
MODEUNG NETWORK 

The interconnected networks of analog processors can be used for the so- 
lution of constrained optimization problems, including the linear programming 
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problem. As studied in Chapter 5, properly designed single-layer feedback net- 
works are capable of acquiring the properties of gradient-type systems. Such 
:;y stems can be used to solve optimization problems since they minimize, in 
time, their energy function value. The crucial task in designing a system seeking 
the minimum of its energy function is the translation of the optimization problem 
into a suitable definition of the energy function. The gradient-type network of 
analog processors that can find the solution of the linear programming problem 
is discussed in this section. 

The linear programming problem may be stated as follows: Find the un- 
knowns v 1? v 2 , . . . , v n , so as to minimize the cost function 

c = a x v j + a 2 v 2 + • • • + a n v n (8.1a) 

where a l9 a 2 , . . . , a n are called price coefficients. The minimization is subject to 
the following conditions, which are linear constraints: 

w H v, + w l 2 v 2 + ■ ■ • + w u v„ 2: bl 

w 2l v l + w 22 v 2 + • • • + W 2 n V„ > b 2 

(8.1b) 


w ml v \ + w m2^2 + ’ ' ’ + W mn V n > b m 


where coefficients bj , for j = 1, 2, . . . , m, are called bounds or requirements. 

The cost function (8.1a) may be equivalently expressed in matrix form using 
the price vector a and the variable vector v 

c = a'v (8.2a) 

where 

a=[«i « 2 ••• a„Y 

Ar 1 1 

v=[v, v 2 ... V„J 
The linear constraint condition (8.1b) now becomes 

wjv s bj, for j = 1, 2, . . . , m (8.2b) 


where 



w j2 



t 


contains the n variable weight coefficients w y7 of each y’th linear constraint in- 
equality (8.1b). 

A gradient-type network based on the energy minimization concept can 
be used to compute the solution of problem (8.1) or (8.2). A variation of the 
mathematical analysis used in Chapter 5 leads to the network shown in Figure 
8.1(a) for the case n — 2 and m = 4. The description below follows the approach 
and the network design example outlined in Tank and Hopfield (1986). 
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(Cuirents) 



Figure 8.1a The architecture of a single-layer feedback network that solves a two-variable four- 
constraint linear programming problem: (a) network. [©/£££; reprinted from Tank and Hopfield 
(1986) with permission.] 


The network modeling the outlined minimization task consists of n neurons 
with output voltages v,-, for i — 1, 2, .... n, which are variables in the linear 
programming problem. The input-output relationship of neurons is denoted as 
v, = g(«j). Since the neurons are described here by a linear activation function, 
only n simple linear amplifiers of constant gain need to be used. The remaining 
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(b) 



Figure 8.1b, c The architecture of a single-layer feedback network that solves a two-variable 
four-constraint linear programming problem (continued): (b) neuron and constraint amplifier char- 
acteristics, and (c) penalty function. [©IEEE; reprinted from Tank and Hopfield (1986) with 
permission.] 


m amplifiers in the network are called constraint amplifiers and are characterized 
by the nonlinear relationship fy = f(uj): 


where 


A f 0, for « > 0 

f( u j) | — Uj, for Uj < 0 

(8.3a) 

II 

1 

s 5 

II 

N> 

2 

(8.3b) 


The voltage transfer characteristics of the neuron and of the constraint amplifier 
are shown in Figure 8.1(b). The nonlinear function definition (8.3a) provides 
for the output of the constraint amplifier to yield a large positive output value 
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when the corresponding constraint inequality it represents is violated, i.e., when 
Uj is negative. Each of the constraint amplifiers receives constraint input current 
proportional to the bound value —bj, and its additional input is expressed as the 
linear combination of neurons’ outputs in the form X/ w j, v ,- 

The output voltage fy of the y’th constraint amplifier produces the input 
currents to all neurons that are linear amplifiers. The input current of the fth 
neuron is proportional to -w ;i ,as well as to the price coefficient -a t . As with the 
neurons of the Hopfield model network of Figure 5.4, the g(u t ) amplifiers have 
input conductance and capacitance denoted as g, and C,, respectively. Although 
these components are not explicitly designed parts of the linear programming 
modeling network, they both represent the amplifiers’ parasitic input impedance 
and are responsible for the appropriate time-domain behavior of the entire net- 
work. At the same time, we assume here that the response time of the t ffitj) 
constraint amplifiers is negligibly small compared to that of the g(« ; ) amplifiers. 

Under the assumptions stated, the KCL equation for the neuron input nodes 
of the network shown in Figure 8.1(a) can be expressed as it was in (5.11). It 
takes the form of the following State equation: 

d«, m 

C,— = ~a i - ufij - X w u fy, for i = 1, 2 ,...,« (8.4a) 

d t j= i 


where G, denotes the sum of all conductances connected to the input of the /’th 
neuron and is equal to 


A OT 

G t = gi - X wji 

j= i 

Using (83), the above formula can be expressed as follows: 
du- m 

C,-~ = —a, — ufi, - X Wjif( w'.v - fy), for i = 1, 2, . . . , n 
at j = i 1 


(8.4b) 


(8.5) 


The discussed network for solving the linear programming task needs to 
minimize a meaningfully selected energy function. Assume now that the follow- 
ing Liapunov’s function is an adequate choice for the energy function in this 
problem: 

A m / \ n f v i i 

E(v) = a'v + X P ( w'v — bj) + X Gj I g 1 (v)dv (8.6a) 

j=i v J ' ;=i Jo 


where F(z ) is defined for dummy variable z as follows: 


m = 


d F(z) 
dz 


(8.6b) 


The components of the gradient vector of the assumed energy function (8.6a) 
can be expressed by finding its derivatives as follows 


dE(\) 


m 
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The time derivative of the energy function of (5.13) can now be expressed using 
(8.7): 


d£ _ « dv,- 
dr " ,= i d T 


cij + UjGf + 


m 

I }v 

7=1 ; 



( 8 . 8 ) 


Since the bracketed expression in (8.8) is the right side of the network State 
equation (8.4a), the time derivative of the energy function can be rewritten in 
the form similar to (5.17): 


_ _ « dv^ dn, 
dr ,=i ' dr dr 

Using (5.18), the above expression can be rearranged as follows: 



(8.9a) 

(8.9b) 


Since C, > 0, and g -1 (v,) is the monotone increasing function, the sum on the 
right side of (8.9) is nonnegative, and therefore we have d£/dr < 0, unless 
dv, /dr = 0, in which case dE/dt = 0. Thus, the network shown in Figure 8.1 
and described with the State equations (8.4) is a gradient-type network. During 
the transient period, the network seeks the minima of its energy function (8.6a) 
and the motion of the system stops there. It will seek the minima in Space v 1? v 2 , 
. . . , v„ in the direction which has the positive projection on the negative gradient 
vector, -V£(v), as explained in more detail in Section 5.3. 

In faet, the cost term a'v is only minimized during the transient State, while 
the second term of (8.6a) is needed to prevent the violation of the constraint 
relationship (8.2b) and to push the network equilibrium point toward the edge 
of the allowed region. This can be seen from Figure 8.1 (c) which shows the 
nonzero penalty function £(z) for inequality constraints that are not fulfilled. 
There is no penalty for constraints met when z > 0. However, a steep penalty 
applies if £(z) = (1 /2)z 2 for z < 0. This results from integrating the function 
f(u) introduced in (8.3a) and using the definition of F(z) from (8.6b). 

A small computational network has been developed using an electronic 
model as shown in Figure 8.1(a) (Tank and Hopfield, 1986) to solve a two- 
variable linear programming problem with four constraints. The constraint am- 
plifiers with voltage transfer funetions (8.3a) are operational amplifiers with 
diode circuitry to provide both damping response (ift = 0), and linear response 
(i jf= w) circuits. The following constraint inequalities representing inequalities 
(8.1b) chosen arbitrarily for the experiment are 


v 2 < 5 
-v, < 5 


5 35 

12 Vl V2 “ 12 
5 35 

-V, + V 2 < — 


(8.10a) 
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Figure 8.2 Two-variable linear programming example: (a) constraints and solution region, 

(b) cost plane c = ^ + v 2 , (c) cost plane c = ir, - v 2 , and (d) trajectory when gradient vectors 
are switched. [Parts (a) and (d) ©IEEE; adapted from Tank and Hopfield (1986) with permission.] 


By comparing coefficients of inequalities (8.10a) with those of (8.1b), weight 
coefficients (conductances Wy) and input biases (currents bj) can be computed. 
Figure 8.2(a) shows constraint lines (8.10a) depicted with dashed lines on the 
plane v,, v 2 , which is the solution domain. The region of penalty-free solutions 
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is within the shaded quadrangle and includes its edges and vertices. Indeed, the 
solution point minimizing the cost function (8.1a) is one of the four vertices of 
the quadrangle shown. 

The choice of the solution point by the system modeling the problem is 
dependent on the orientation of the cost function. Let us look at the geometrical 
interpretation of the cost minimization issue. The cost function itself can be 
depicted as a plane in the three-dimensional space v 1 ,v 2 ,c, for n — 2. Moreover, 
the cost plane described by Equation (8.1a) always intersects the origin. The 
direction of the negative gradient vector of the linear t\yo-variable cost function 
c can be represented as a vector from the origin pointing in one of the directions 
as shown in Figure 8.2(a). The negative gradient vector of the cost function c is 
given as 


-Vc(v) 


-a x 

~ a 2. 


(8.10b) 


As the cost function is changed, the cost plane tilts in a new direction, and 
the optimum solution of the linear programming problem may also change by 
displacement from one point to another. In such a case, the solution would dis- 
locate from one vertex to another. Figures 8.2(b) and (c), show two sample cost 
planes defined by the cost function (8.1a) with price coefficients a x — a 2 = 1, 
and a x = 1, a 2 = —1, respectively. The corresponding vector s indicating the 
direction of the steepest descent of cost functions can be expressed as 


and 




(8.10c) 

(8.1 Od) 


The negative gradient vectors, shown for each of the planes in Figures 8.2(b) and 
(c) as Corning out of the origin, indicate the directions of the transient solution 
of the problem for known constraints. As can be seen from Figure 8.2(a), the 
first vector would guarantee convergence of the network to the vertex (-5,-5), 
the second would result in the solution at the vertex (—5,5). 

For the series of experiments reported in Tank and Hopfield (1986), the 
output voltages v x and v 2 of the modeling network have consistently stabilized 
in one of the four vertices shown, dependent on the price coefficients chosen. The 
cost plane gradient vector was swept in a circle to check the convergence to the 
optimal solution in each case. Each time, the Circuit stabilized at the minimum 
cost points corresponding to the correct constrained solution for a chosen gradi- 
ent direction. In another set of experiments reported (Tank and Hopfield, 1986), 
the gradient vectors rapidly switched direction after the network had found the 
solution. This is illustrated in Figure 8.2(d). The stable solution of the problem 
found by the network moved accordingly from the vertex (—5, -5) to the vertex 
(5,5) along the trajectory as shown. Since the solution area is always convex 
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for linear programming problems (Simmons, 1972), the network is guaranteed to 
converge to the optimum solution. 


8.2 

CHARACTER 

RECOGNITION NETWORKS 

As already discussed in Chapter 4, multilayer feedforward networks can 
be trained to solve a variety of data classification and recognition problems. 
Some of the examples of introductory classification applications were presented 
in Section 4.6. This section covers in more detail character recognition networks 
of both printed and handwritten characters. Initial emphasis is placed on the 
comparison of different leaming techniques for feedforward architectures. We 
also demonstrate that handwritten character recognition tasks can be successfully 
performed by unusual and unconventional neural network designs. Specifically, 
successfully tested digit recognizers have employed hand-crafted multiple neural 
template Windows, or transformations, and have involved specialized multilayer 
feature detectors with a limited number of degrees of freedom. 


Multilayer Feedforward Network for 
Printed Character Classification 

This section provides a comprehensive performance evaluation of the limi- 
tations and capabilities of multilayer feedforward networks for printed character 
recognition. Network leaming parameters, architectures, and leaming profiles are 
investigated (Zigoris 1989). Based on the comparison of different network ar- 
chitectures, the best size and training parameters will be devised for printed 
character classifiers (Zurada et al. 1991). 

An introductory study of a multilayer continuous perceptron network for 
simple printed character recognition has been reported by Chandra and Sudhakar 
(1988). The performance of various classifiers was discussed and compared for 
recognition of digits “0” through “9” encoded as 20-tuple binary-valued input 
vectors representing black and white pixels. An error back-propagation network 
for classification of these data was evaluated. The network is of a 20-10-10 size 
denoting the number of input nodes, hidden-layer neurons, and output neurons, 
respectively. This convention for denoting network size using symbols I-J-K is 
consistent with Figure 4.7 an with the notation in earlier chapters. At least 1000 
training cycles were used to develop the network, and reasonable classification 
results were achieved and reported by Chandra and Sudhakar (1988). 

A similar study was done on pixel-image classification of amoebae nu- 
clei images (Dayhoff and Dayhoff 1988). This application, which is similar in 
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Figure 8.3 Bit maps for sample characters “1,” “B,” and “a,” for recognition. 


principle to printed character recognition, involved a gray-scale pixel matrix with 
geometric pattems that need to be recognized. The 7X9 pixel-grid signal was 
used and fed through 16 hidden-layer neurons to a 7 X 9 output grid. The study 
was based oh four training images; 3000 cycles were required to train the network 
successfully. 

Let us review the performance of multilayer feedforward architectures for 
classification of a full set of printed lowercase and uppercase letters, digits, and 
punctuation characters. In this application, standard dot-matrix printer characters 
need to be recognized robustly and correctly from a set of 95 characters. Each 
character is a 7-column X 10-row pixel matrix defined by Apple Computer 
(1986). The example bit maps for three characters from the set are shown in 
Figure 8.3. The training set consisting of 95 characters has also been used to test 
the performance of the trained network, and it has provided a relatively large 
number of input /output pattem pairs. Depending on the requirements for the 
classifying network, the output pattem has been chosen either as an 8-bit ASCII 
code of the character (distributed representation) or a single output bit that has 
been set to 1 for each character pattem to be recognized (local representation). 
The total of either 8 or 95 output neurons has been required, respectively, for 
the output layer trained in each of the output representations mentioned. 

Continuous unipolar neurons are employed throughout this experiment for 
training and recall. The binary input vector representing the pixel pattem of each 
character has been fed directly into the network. No effort has been made to 
perform either preprocessing or feature extraction of any kind in this experiment. 
The outputs have been thresholded at 0.5 to obtain a binary output value in the 
test mode. The three basic network properties investigated for this application 
were the leaming constant 17, the activation function steepness coefficient Å, and 
the network architecture. The topologies of the network under study have been 
limited out of necessity since the number of different architectures potentially 
relevant for this classification task is enormous. 
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The number of input nodes / has been kept at a constant value of 70, equal 
to the number of pixels. Single and double hidden-layer networks with different 
layer sizes denoted as J have been evaluated for both 8 and 95 output neurons. 
The number of output neurons is denoted as K. Ideally, the network should be 
able to generate 8-bit ASCII encoded output. Such an 8-bit output architecture 
using eight output neurons is of particular importance because it combines two 
steps. Both recognition and encoding of the character pattem are performed by 
the network in a single processing step. In addition, this architecture yields more 
concise classification results compared to the local representation which requires 
95 output nodes. 

The simulation experiments described later in this section indicate that the 
number of neurons in each of the hidden layers can drastically affect both the 
accuracy and the speed of training. Some pattems require decision regions that 
cannot be realized by a low number of hidden layer neurons. Too many neurons 
in the hidden layer, however, can prevent effective training from occuning since 
the initial decision regions formed become too complex due to the excessive 
number of neurons for the application. The end result is often that the network 
weights get readily trapped in a local minimum during training and no further 
leaming occurs. 

The performance of the network has been measured in terms of its response 
errors with respect to correct classification results. Four different error measures 
have been monitored during the training and recall phases. They are £ max , £ ms , 
N e „, and E d . 

The maximum error denoted as £ max expresses the largest of the single 
neuron errors encountered during the full training cycle consisting of P = 95 
steps, where P denotes the size of the training set. A value of 1 implies that 
at least 1 bit is totally incorrect. Since the character recognition application 
requires binary output, the network implements a postprocessing thresholding 
function to yield binary output decisions. The error E ^ is defined as in (4.35) 
and it serves as an ordinary root-mean-square normalized quadratic error mea- 
sure, before thresholding. The absolute number of bit errors N en at thresholded 
outputs during the single complete training cycle is also used as a local measure 
of comparison of leaming quality and speed. The decision error E d defined in 
(4.36) is used as a normalized error value computed over the total number of 
output bit errors N crr resulting at K thresholded outputs over the complete training 
cycle consisting of presentation of P pattems. 

The simulated error back-propagation training runs were performed in two 
phases (Zurada et al. 1991). The first phase consisted of a large number of 
preliminary runs with various combinations of parameters and architectures. The 
runs were performed to determine an overall training effectiveness for various and 
often intentionally diversified leaming conditions. This can be termed as a coarse 
search for suitable architectures. During the second phase of the runs, optimal 
parameters for network leaming were explored. This second phase emphasized 
a fine heuristic search for optimum training conditions. 
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TABLE 8.1 

Evaluation of 70-J-95 architecture, single 
hidden layer, J = 35 or 70 (printed character 
recognition network). 


Run 

Steepness 

Factor 

A 

Learning 

Constant 

V 

Hidden 

Layer 

J 

Training 

Length 

(cycles) 

Max 

Error 

^max 

^rms 

Bit 

Errors 

^err 

Decision 

Error 

Ed 

1 

1 

0.1 

Kl 


0.9991 

10~ 6 

1 

0.0001 

2 


0.5 

H 

575 

0.0209 

10“ 6 

0 

0 

3 

1 

0.1 

35 


0.1279 

10~ 6 

0 

0 





3975 

0.0033 

10 -6 

0 

0 

4 


0.5 

35 


0.0627 

1(T 6 

0 

0 





3975 

0.0004 

0 

0 

0 

5 

1 

0.1 

35 


1 

2 • 10“ 6 

4 

0.0004 

6 


0.5 

35 

3975 


2 • 1(T 6 

2 

0.0004 

7 

2 

0.1 



1 

4^ 

O 

1 

< 3 \ 

12 

0.0013 





3975 


4 • 10“ 6 

12 

0.0013 

8 


0.5 




2 • 10 -6 

5 

0.0006 





3975 


2- 10 -6 

5 

0.0006 

9 


0.1 

35 

3975 


8 • 10“ 6 

51 

0.0057 

■M; ? 

5 

0.1 



1 

9 • 10“ 6 

68 

0.0075 





3975 


9 • 10~ 6 

68 

0.0075 

11 


5 




1.9 • 10 -5 

281 

0.0311 





3975 


1.9 • 10 -5 

280 

0.0310 


Table 8.1 shows results of the initial series of runs for a single hidden layer 
network of size 70-/-95 with / = 35 or 70. The training length for each run 
was different and involved up to several thousand cycles as listed in the fifth 
column of the table. The most important result of the preliminary training runs 
has been that for A = 5 the network did not train properly (runs 9-11). The 
lowest number of single bit errors was 51 for 35 hidden units when r\ - 0.1. 
The number of errors got worse as the number of hidden units was increased. 
In practical simulation the error tends to oscillate. Since the activation function 
around zero has been very steep for this particular series of runs, this caused 
overcorrections and the weights did not settie at their proper value but instead 
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swung from one extreme to another. Further increases of 17 tend to aggravate the 
problem by further overcorrections and have led to increased error values. 

The problem of excessive errors is reduced with A = 2, which ensures 
slower leaming, but at least several bit errors have consistently occurred (runs 
5-8). Finally, runs with A = 1 seem to yield the best results (runs 1-4). Very 
small error values have been obtained for both 35 and 70 hidden units and for 
leaming constants having a value of either 0.1 or 0.5. The training for 0 bit 
errors required fewer cycles with 70 hidden units (run 2); however, a 70-35-95 
network with A = 1 initially tumed out to be the best for the task evaluated 
(runs 3 and 4). Simulations for 140 hidden units resulted in bit errors that were 
unacceptably high (637 and 1635), oscillatory, and involved prolonged training 
sessions. 

Figure 8.4 illustrates the decision error of the recall phase versus the training 
length for the 70-35-95 network. It has been observed from a series of simu- 
lations that most training runs for the same network configuration and leaming 
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Figure 8.5 Learning profiles for a network with 70 hidden neurons; local representation. 


parameters yield only a slightly different learning profile. This is due to the faet 
that weights are initialized at random values. Thus, the learning profiles shown in 
the figure can be considered typical, or average, for the selected network size and 
training parameters. The best learning profile and the least final error is achieved 
for A = 1 and rj = 0.5 (run 2). The decision error E d typically reduces to zero 
after several hundred training cycles. The error curves on Figure 8.4 for Å = 1 
and 2 also show reasonable training speed and accuracy. 

In contrast, the decision error E d for A = 5, rj = 0.1 does not converge 
to a reasonable value. Somewhat similar learning profiles have resulted for 70 
hidden-unit networks. Training results for this network configuration are depicted 
in Figure 8.5. Figure 8.6 shows the results of a 140 hidden-neuron network 
training process. Notice that the error axis now starts at E d = 0.04. The decision 
error does not undergo substantial reduction in spite of continued training. This 
configuration was therefore found to be unacceptable and was not pursued beyond 
the preliminary phase of training runs. 

Table 8.2 shows the simulation results for double hidden-layer networks 
with the output layer of eight neurons. For this configuration the characters need 
to be encoded by the classifier in the ASCII format. Various architectures that 
have been tested employ the first hidden layer of size J\ = 35 and 70, and 
the second hidden layer of size J2 = 8 and 70. The results for double hidden 
layers are, in general, better than for a single hidden layer, and the final errors 
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Figure 8.6 Learning profiles for a network with 140 hidden neurons; local representation. 


are smaller. Three specific architectural hidden layer sizes (71,/2) that have been 
simulated are: (70, 70), (70, 8), and (35,8) (Zurada et al. 1991). 

Simulations 1 through 6 correspond to A = 1. Virtually all runs show rea- 
sonable results and a small number of bit errors. Runs 2 and 4 have the most 
bit errors, whereas all remaining runs converge to the single bit error. These 
particular runs have = 0.5 in common. In addition to one bit error, runs 3, 5, 
and 6 also display the fastest convergence. Runs 7 through 12 are performed for 
A = 2. These results are more varied and consist of some excellent results along 
with some extremely bad ones. The notably bad results are runs 10 and 12. Not 
only has training been excessively long, but run 10 resulted in an oscillatory 
bit error with a peak value of 103, and run 12 resulted in a large error of 150 
incorrect decision bits. Runs 7 and 8 exhibited superb results since they both 
converged to zero bit errors. In addition, convergence of training to zero bit 
errors was extremely quick, 90 cycles for run 7, and only 40 cycles for run 8. 
Both these runs had J 1 = 70 and J2 = 70 in common. Given this large number 
of hidden-layer units, the redundancy in feature recognition for this network is 
rather large. This seems to be the most likely reason for this particular network 
architecture to have produced no bit errors. 
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TABLE 8.2 

Evaluation of 70-J1-J2-8 architecture, double 
hidden layer (printed character recognition network). 


Run 

Steepness 

Factor 

A 

Learning 

Constant 

V 



Max 

Error 

^max 

^rms 

Bit 

Errors 

Nerr 

Decision 

Error 

Ed 

1 

1 

0.1 

70 

70 

2190 

1 

58 • 1(T 6 

1 

0.0013 

2 


0.5 

70 

70 

310 


83 • 10 -6 

3 

0.0039 

3 

1 

0.1 

70 

8 

390 

1 

98 • 10~ 6 

1 

0.0013 

4 


0.5 

70 

8 

980 


137 • 10 -6 

8 

0.0105 

5 

1 

0.1 

35 

8 

620 

1 

80 • 10~ 6 

1 

0.0013 

6 


0.5 

35 

8 

870 


87 • 10“ 6 

1 

0.0039 

7 


0.1 

70 

70 

90 

0.2034 

47 • 1(T 6 

0 

0 


2 




2520 

0.0001 

3.6 • 1(T 6 

0 

0 






4990 

0.0001 

3.6 • 10 -6 

0 

0 

8 


0.5 

70 

70 

40 

0.9994 

58 • 10 -6 

0 

0 






210 

0.0041 

11 • 10 -6 

0 

0 






830 

0.0001 

3.6 • 10~ 6 

0 

0 






3300 

0.0000 

0 

0 

0 

9 

2 

0.1 

70 

8 

2160 

1 

51 • 1CT 6 

1 

0.0013 

10 


0.5 

70 

8 

4990 

1 

435 • 10~ 6 

103 

0.1355 

11 

2 

0.1 

35 

8 

110 

1 

98 • 10~ 6 

1 

0.0026 

12 


0.5 

35 

8 

4990 

1 

551 • 10 -6 

150 

0.1973 


Table 8.3 summarizes the results of two runs with J\ = J2 = 70, K = 8, 
A = 2, i) = 0.1, and tj = 0.5. Since this configuration with this choice of 
leaming parameters has been so successful, additional simulations have been 
performed for this case to ensure consistency and to allow possibly for further 
optimization of the training. Both runs in Table 8.3 show excellent results, with 
run 1 converging to zero decision errors after 190 cycles, and run 2 converging 
to zero decision errors after only 60 cycles. 

The discussion presented thus far indicates that the number of hidden layers 
and their sizes are of crucial importance for a particular application. Apparently 
a range of values exists for the number of hidden layers and their size that yields 
especially good results. Figure 8.7(a) shows the final decision error E d for A = 1 
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TABLE 8.3 

Evaluation of 70-70-70-8 architecture (printed 
character recognition network). 


Run 

Steepness 

Factor 

A 

Learning 

Constant 

V 

Hidden 
Layers 
J1 J2 

Training 

Length 

(cycles) 

Max 

Error 

^max 

£rms 

Bit 

Errors 

^err 

Decision 

Error 

Ed 

1 

2 

0.1 

70 70 

1 

0.999 

798- 

10 -6 

374 

0.492 





10 

0.999 

529* 

10~ 6 

160 

0.210 





20 

0.999 

428* 

1(T 6 

106 

0.139 





30 

0.998 

319* 

1(T 6 

50 

0.066 





40 

0.997 

214* 

10 -6 

23 

0.030 





50 

0.997 

120* 

10 -6 

4 

0.005 





60 

0.996 

98* 

10~ 6 

3 

0.004 





70 

0.996 

91 * 

1CT 6 

3 

0.004 





80 

0.996 

76* 

1(T 6 

2 

0.003 





90 

0.993 

76* 

1(T 6 

2 

0.003 





190 

0.042 

29* 

10 -6 

0 

0 





200 

0.020 

22* 

1(T 6 

0 

0 





1500 

0.0003 

3.6* 

10 -6 

0 

0 





2000 

0.0002 

3.6* 

10~ 6 

0 

0 

2 

2 

0.5 

70 70 

1 

0.997 

789* 

1(T 6 

329 

0.433 





10 

0.998 

660* 

10 -6 

244 

0.321 





20 

0.999 

428* 

1(T 6 

105 

0.138 





30 

0.996 

192* 

lO -6 

18 

0.024 





40 

0.910 

54* 

lO -6 

1 

0.001 





50 

0.756 

51 • 

lO -6 

1 

0.001 





60 

0.005 

18* 

1(T 6 

0 

0 





70 

0.003 

14* 

10~ 6 

0 

0 





420 

0.002 

3.6* 

10 -6 

0 

0 




J 

1990 

0.000 

3.6* 

10~ 6 

0 

0 


and 7} = 0.1 versus the number of hidden units of the single hidden layer. The 
error has been computed after 500 training cycles for the number of hidden layer 
neurons covering the range from 5 to 150. The number of output units is kept 
equal to 8 thus ensuring an ASCII format for the output. The results indicate that 
low error values have been achieved within the range from 20 to 90 hidden units. 
The network with less than 20 hidden neurons has been incapable of leaming 
all pattems, presumably, because the weight space has been too complicated. 
The error level has also become intolerable for more than 90 hidden units. The 








system seems to be seriously overdetermined because of the large number of 
hidden units. 

The training of a network with two hidden layers yielded superior results. 
Figure 8.7(b) shows the final decision error E d versus the number of hidden units. 
In this training simulation, the number of hidden units is assumed to be variable 
but identical for both layers. The resulting optimum architecture is somewhat 
similar to that of the single hidden-layer network, ranging from 20 to 85 units. 
The primary difference was that the two hidden-layer network version tends to 
leam more reliably and quickly. 

Figure 8.8 shows selected leaming profiles for the training of single-layer 
networks. Several hidden-layer sizes covering the range from 20 to 80 units have 
been tested to find an optimum architecture. The number of training cycles needed 
to achieve a fixed decision error value seems to remain largely invariant with 
respect to the number of hidden-layer units. Although it is impossible to express 
analytically the error profiles as a function of training length, the simulations 
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Figure 8.7b Decision error versus the number of hidden units, A = 1 and tj = 0.1 (contin- 
ued): (b) double hidden-layer network of identical sizes (J1 = J2). 


seem to indicate that the decision error is approximately inversely proportional 
to the number of training cycles. With continued training, the error somewhat 
saturates at a low level. These conclusions seem to hold only for the case of 
feasible architectures, i.e., architectures that can leam the problem under study. 

For this nontrivial application of multilayer feedforward networks for char- 
acter classification, a number of reasonable solutions emerge. Let us summarize 
the results of the classification case study discussed in this section. The compu- 
tationally most efficient value of activation steepness, A , seems to be of unity 
value throughout each leaming session. The leaming constant of 0.5 works best 
with the number of hidden units ranging from 25 to 80. This size of hidden layer 
seems to be the most advantageous for both one and two hidden-layer systems. 
The best classification results are obtained with a two hidden-layer network em- 
ploying about 70 hidden units per layer. Leaming within this architecture has 
been quick and reliable; this size of layers presumably provides excellent re- 
dundancy. Remarkably, it has also been found that this size could be reduced 
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to approximately 25 without any major deterioration of classification. Training 
of networks with both too many or too few hidden units has resulted in slow 
leaming with a somewhat excessive number of decision errors. 

Finally, the configuration with 8 output neurons tums out to be generally 
superior compared to the local representation of classification involving 95 output 
neurons. Although both these architectures ensure a zero final decision error and 
thus both achieve perfect character classification, the configuration with 8 output 
neurons trains faster. It also results in a smaller network. 


Handwritten Digit Recognition: 

Problem Statement 

The ultimate success of neural networks depends on their effectiveness in 
solving a variety of real-life classification, recognition, or association problems 
that are more demanding and more difficult than any of the problems studied 
thus far. Let us look at recognition of objects that are drastically less structured 
than standard printed characters. In this group of problems, character recognition 
of handwritten digits represents an important example of a realistic, yet difficult, 
benchmark recognition task. This task has a clearly defined commercial impor- 
tance and a level of difficulty that makes it challenging but still not so large 
and complex so as to be completely intractable. Recent technical reports have 
demonstrated that neural networks can perform handwritten digit recognition with 
state-of-the-art accuracy (Jackel et al. 1988; LeCun et al. 1989; Howard et al. 
1990). In a number of respects, neural network solutions surpass even those 
obtained using other approaches. The basic neurocomputing techniques used for 
handwritten character classification are reviewed below. 

Analysis shows that many difficult pattem recognition problems can be for- 
mulated as multidimensional curve fitting using the methods of approximation 
leaming described in Section 2.4. In character recognition, however, the general 
mies for distinguishing between characters are neither known, nor have they 
been formulated. The best current process is to examine a large cross section 
of the character population with the goal of tinding a function that adequately 
generalizes the exemplars from the training set. This would possibly allow for 
recognition of the test characters not contained in the original training set. Let us 
look at the neural network leaming algorithms and approaches that first prepro- 
cess and then map handwritten digit character images to one of the ten categories. 
As we will see, neural network classifiers often yield comparable or better ac- 
curacy and, more importantly, require far less development time compared to 
conventional classifiers. 

The example database used to train and test the network consists of 9298 
isolated numerals digitized from handwritten zip codes that appear on U.S. mail 
envelopes (LeCun et al. 1989). Typical digit examples used for this project are 
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Figure 8.9 Examples of handwritten zip codes (top) and normalized digits from the training / 

test database (bottom). [© MIT Press; reprinted from LeCun et al. (1989) with permission.] 


shown in Figure 8.9. The digits are written by many different people using a great 
variety of sizes, writing styles, instruments, and with a widely varying amount of 
care. It can be noted that many of the digits are poorly formed and are hard to 
classify, even for a human. Of the 7291 sample digits used for training, 2007 have 
been used for test purposes. The captured digits are first normalized to fill an area 
consisting of 40 X 60 black and white pixels. The resulting normalized pattems 
are presented to a neural network after being reduced to 16 X 16 pixel images. 
Further processing is performed by a mix of custom hardware and specialized 
software for image processing. 
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The general, common strategy for handwritten character processing is to 
extract features from the images and then perform the classification based on 
the resultant feature map (Graf, Jackel, and Hubbard 1988). As the first step of 
the recognition sequence, a digital computer equipped with frame-grab hardware 
captures a video image of a handwritten character. The computer thresholds the 
gray-level image into black and white pixels. This stage can be termed “image 
capturing.” Next, the image is scaled both horizontally and vertically to fill a 
16 X 16 pixel block. The 16 X 16 scaled character format appears to be of adequate 
resolution for the set of 10 digits, but would probably be too small to deal with 
letters or more complex characters. 


Recognition Based on Handwritten 
Character Skeletonization 

This section reports on one of the earlier attempts at budding neural network- 
based handwritten character recognizers (Jackel et al. 1988; Graf, Jackel, and 
Hubbard 1988). The image capturing and scaling as described in the preceding 
paragraph is followed by the “skeletonization.” This is a process that makes 
the lines one dimensional by removing meaningless linewidth variations. The 
rationale for line-thinning preprocessing is that a smaller number of features is 
sufficient to analyze the skeletonized images. Since the linewidth does not carry 
much information about the character itself, the skeletonization removes pixels of 
the image until only a backbone of the character remains, and the broad strokes 
are reduced to thin lines. Skeletonization is illustrated in Figure 8.10(a) where 
the gray area represents the original character, the digit 3, and the black area is 
its skeletonized image. 

The skeletonization is implemented through scanning of the 5 X 5 pixel 
window template across the entire image. Twenty different 25-bit window tem- 
plates are stored on the neural network chip. These templates are responsible 
for performing the skeletonization of the images. One of the templates is shown 
in Figure 8.10(b). During scanning each template tests for a particular condition 
that allows the deletion of the middle pixel in the window. If the match of the 
image pixels to any of the templates exceeds a preset threshold value, the center 
pixel of the character bit map is deleted. 

In this manner, the entire 16 X 16 pixel image is scanned, and a decision 
is made for each pixel in the image whether that pixel just makes the line fat 
and can be deleted, or whether its presence is crucial to keep the connectivity of 
the character intact (Graf, Jackel, and Hubbard 1988). The 20 different templates 
for scanning images were crafted by network designers in a systematic but ad 
hoc manner. Examples of a single scanning pass through the image using four 
selected window templates are shown in Figure 8.11. The black pixels deleted 
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(a) 



(Template: black pixels 
are excitatory, grey are 
inhibitory, white are do 
not care connections) 


00 + 000 + + - 0 + + +0 000000 


(b) 


Figure 8.10 Skeletonization of handwritten characters: (a) pixel image of digit 3 and (b) exam- 
ple of one of the window templates used for line thinning. [©IEEE; reprinted from Graf, Jackel, 
and Hubbard (1988) with permission] 


in the images shown as a result of each template pass are marked in gray on the 
four images in each box. 

The skeletonization process is followed by the feature extraction process for 
each character. In the feature extraction stage, the thinned image is presented to a 
number of 7 X 7 feature extracting pixel window templates. The neural network 
chip stores 20 different feature extracting window templates, again chosen ad 
hoc , but inspired by results from experimental neurobiology. Feature extraction 
processing by means of window templates is illustrated in Figure 8.12. The tem- 
plates check for the presence of oriented lines, oriented line ends, and ares in the 
image of a skeletonized character, for example, like the one displayed in Figure 
8.12(a). Examples of the extracted féatures of line endstops and horizontal lines 
are shown in Figures 8.12(b), (c), and (d). Whenever a feature template match 
exceeds the preset threshold, a 1 is i set in a blank feature map for that corre- 
sponding feature. In this way, a featufe map is produced for every template. The 
maps of such features for each evaluated character skeleton are then subjected 
to an “OR” funetion after the scan is completed (Jackel et al. 1988). 

The feature extraction process terminates by mapping the skeletonized input 
pattem of 16 X 16 size image into 20 different 16 X 16 feature images. To 
compress the data, each feature map is then coarse-blocked into a 3 X 3 array. 
As a final result of processing for feature extraction, 20 different 3X3 feature 
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Figure 8.11 The skeletonization process, four window templates and four passes shown. 
(Templates: black pixels are excitatory, gray are inhibitory, white are do-not-care connections; 
digits: gray pixels are deleted by Sach template.) [©IEEE; adapted from Jackel et al. (1988) 
with permission.] 


vectors, consisting jointly of 180 feature entries, are produced. These entries are 
then classified into one of the 10 categories signifying each of the 10 digits. 
Several conventional classification algorithms have been used to determine the 
class membership of the resulting vector. The overall results of this method 
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(a) 



Figure 8.12 Examples of feature extraction: (a) skeletonized image for future extraction, 

(b) and (c) features of line endstops, and (d) features of horizontal lines. (Templates: black pixels 
are excitatory, gray are inhibitory, white are do-not-care connections.) [©IEEE; reprinted from 
Jackel et al. (1988) with permission.] 


for handwritten character recognition have varied between 95% accuracy for 
hastily written digits and 99% accuracy for carefully written digits (Jackel et 
al. 1988). The tests were performed on digits taken again from the U.S. post 
office databases. 

The approach to handwritten digit recognition described in this section is 
based on skeletonization and feature extraction using ad hoc experimentation and 
hand-chosen templates. This concept uses specially developed window templates 
and also a carefully organized scanning process. Through the scanning of suitable 
window templates across the character bit map, the original pattem of a character 
is decomposed into a single large-size feature vector. This is then followed by 
image compression. The sequence of processing is implemented by a neural 
network chip designed specifically fpr this purpose. The final classification of 
the feature vector obtained from eadh character is obtained using conventional 
computation. 
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The reader may have noticed that the approach presented in this section 
involves neither neural leaming techniques nor the known recall algorithms de- 
scribed in the preceding chapters. Rather, the approach is based on analyzing 
the matching of the pattem with the number of movable window templates. The 
matching involves processing of the pattem segment through suitably chosen Win- 
dows that represent weights for the pixels overlapping with the evaluated image. 
As we see, this approach, in which neural network leaming has been replaced by 
using intuitively produced weight matrices associated with each window template, 
can be termed rather unconventional. 


Recognition of Handwritten Characters 
Based on Error Back-propagation Training 

Unlike the network described in the preceding section, the multilayer feed- 
forward neural network for handwritten character recognition can be directly fed 
with images obtained as slightly preprocessed original pattems containing low- 
level information. First, the 40 X 60 pixel binary input pattems are transformed to 
lit into a 16 X 16 pixel frame using a linear transformation. This frame contains 
the target image. The transformation preserves the aspect ratio of the character, 
but the resulting target image consisting of 256 entries is not binary. It has 
multiple gray levels because a variable number of pixels in the original input 
pattem can fali into a given pixel of the target image. The gray levels of each 
resulting target image are then scaled and translated to fail within the bipolar 
neuron’s output value range, —1 to +1 (LeCun et al. 1989). This step is needed 
to utilize fully the mapping property of the activation function of the first layer 
of neurons. 

The remainder of the recognition processing is performed by a rather elab- 
orate multilayer feedforward network trained using the error back-propagation 
technique. The architecture of the network is shown in Figure 8.13. The network 
has three hidden layers, J 1, /2, and /3, and an output layer consisting of 10 
neurons. The hidden layers perform as trainable feature detectors. 

The first hidden layer denoted as /I consists of 12 groups of 64 units per 
group, each arranged in an 8 X 8 square. Each unit in Jl has connections only 
from the 5 X 5 contiguous square of pixels of the original input. The location of 
the 5 X 5 square shifts by two input pixels between neighbors in the hidden layer. 
The motivation for this is that high resolution may be needed to detect a feature 
while the exact position of it is of secondary importance. All 64 units within a 
group have the same 25 weight values so they all detect the same feature. Thus, 
a group in layer Jl serves as a detector of a specific feature across the entire 
input field. However, the units do not share their thresholds. The weight sharing 
and limitations to processing of 5 X 5 receptive input helds has resulted in a 
reduced number of connections equal to 19 968 or (12 X 64) by (25 4- 1), from 
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Figure 8.13 Architecture of multilayer feedforward network for handwritten character recogni- 
tion. [© MIT Press; adapted from LeCun et al. (1989) and reprinted with permission.] 


the = 200 000 that would be needed for a fully connected first bidden layer. Due 
to the weight sharing, only 768 biases and 300 weights remain as free parameters 
of this layer during training. 

The second hidden layer uses a similar set of trainable feature detectors 
consisting of 12 groups of 16 units, again taking input from 5X5 receptive 
helds. A 50% reduction of image size and weight sharing take place in this layer. 
The connection scheme is quite similar to the hrst input layer; however, one 
slight complication is due to the faet that /I consists of multiple two-dimensional 
maps. Inputs to J2 are taken from 8 of the 12 groups in /l. This makes a total 
of 8 X 25 shared weights per single group in J2. Thus, every unit in J2 has 200 
weights and a single bias. Since this J2 layer consists of 192 units, it employs 
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38 592 (192 X 201) connections. All these connections are controlled by only 
2592 free trainable parameters. Twelve feature maps of this layer involve 200 
distinet weights each, plus the total of 192 thresholds. 

Layer J3 has 30 units and is fully connected to the outputs of all units in 
layer J2. The 10 output units are also fully connected to the outputs of layer 
J3. Thus, layer J3 contains 5790 distinet weights consisting of 30 X 192 units 
plus 30 thresholds. The output layer adds another 310 weights, of which 10 are 
thresholds of output neurons. 

The discussed network has been trained using the error back-propagation 
algorithm. The training has involved only 23 full cycles. Thus, each character 
has been presented about 16 769 times to the network under the assumption that 
each digit occurs with the same probability in the training set of 7291 zip code 
digits. The weights were initialized with random values so that the total input to 
each neuron was within the operating range of its sigmoidal activation funetion. 
The target values for the output layer were chosen within the range of the sigmoid 
rather than at extreme ± 1 values, as is usually done for classifier training. This 
prevented the weights from growing indefinitely during the training. The weights 
were updated after the presentation of each single pattem. 

Let us summarize the performance of the trained network of Figure 8.13 
(LeCun et al. 1989). After completion of the training, only 10 digits from the 
entire training set had been misclassified (0.14%), thus yielding the decision 
error E d as in (4.36) with a value 0.137 X 10~ 6 . However, 102 mistakes (5.0%) 
were reported on the test set and the decision error rose to E d = 0.0517. The 
convergence to the correct response was found to be rather quick during the 
simulation of the test performance. 

In a more realistic performance evaluation test, the rejection of marginal 
characters has been implemented that are very difficult to classify even for a 
human. In this approach, marginal characters are exeluded from testing. The 
character rejection from the test set has been based on the criterion that the 
difference between the two most active output levels should exceed a certain 
minimum value to ensure reasonably reliable classification. To lower the error of 
5% on the full test set to 1% using the rejection concept, the scribbled 12.1% of 
test pattems had to be removed because of their similarity to the classification 
of other digits. 

Other networks with fewer feature mapping levels were also evaluated by 
LeCun et al. (1989) but they produced inferior results. For example, a fully con- 
nected network of 40 hidden units in a single layer was not able to generalize 
as well, compared to the architecture described. Misclassifications measured 1.6 
and 8.1% on the training set and test set, respectively. In addition, 19.4% of the 
test set characters had to be removed to achieve the error of 1% when using the 
marginal character rejection concept. 

It can be seen that the error back-propagation algorithm yielded the results 
that appear to bé the state-of-the-art situation for handwritten digit recognition. 



8.3 NEURAL NETWORKS CONTROL APPLICATIONS 


485 


The trained network has many aetual connections but only relatively few free pa- 
rameters are selected during training. Also, distinet feature extraction constraints 
were imposed on the network. The training was implemented on commercial dig- 
ital signal processing hardware that efficiently performed multiply/accumulate 
operations. The overall throughput rate of the complete, trained digit recog- 
nizer, including the image acquisition time, is reported to be 10 to 12 digit 
classifications per second. 


8.3 ii ii 

~ NEURAL NETWORKS 
CONTROL APPLICATIONS 

The use of neural networks in control applications — including process Con- 
trol, robotics, industrial manufacturing and aerospace applications, among 
others — has recently experienced rapid growth. The basic objective of control is 
to provide the appropriate input signal to a given physical process to yield its 
desired response. In control systems theory, the physical process to be controlled 
is often referred to as the plant. The plant input signals, called actuating signals, 
are typically needed to control motors, switches, or other actuating devices. If 
the actuating signal at the input to the plant shown in Figure 8.14 is generated 
by the neural network-based controller, the case can be termed neural control , 
or, briefly, neurocontrol. This section covers the basic concepts and examples of 
neural control techniques. Those neurocontrol aspects that deal specifically with 
robotics applications are discussed in the subsequent section. 


OverView of Control Systems Concepts 

This section presents an overview of the elementary concepts and terminol- 
ogy of control systems. The scope of presentation is limited only to issues that 
are later related in the context of neural controllers. Specifically, the concepts 
of transfer characteristics, transfer funetions, and feedforward and feedback con- 
trol are reviewed below. Readers familiar with such introductory material may 
proceed direetly to the next section. 

Solution of the control problem depends direetly on the precision with which 
characteristics of the process are known. In a single-channel control problem 
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Figure 8.14 Illustration of control problem. 
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Figure 8.15 Illustration of static and dynamical plant transfer function: (a) electric one-port 
element as a plant, (b) resistor, and (c) capacitor. 


when the plant input is a scalar variable, for example, the process, if known, 
can be described by a relationship between the plant actuating signal and the 
plant response. In general, this relationship is expressed as P, where P is the 
process transfer function in the form P(s ), and s is the complex variable. We 
consider only time-invariant plant characteristics in this section. In such cases, 
P(s ) includes the dynamic properties that relate the input and output signals of 
the plant. For plants that can be described in the time domain without a time 
variable, P(s) is independent of s and reduces to the simple functional relationship 
between the actuating signal and the response. Such plants are called nondynamic, 
memoryless, or simply static. For static plants the transfer function P(s) becomes 
P and is called the transfer characteristics . 

Let us consider a simple example of static and dynamical plants. Assume 
that the current through a one-port electric element is the plant output, which 
needs to be controlled. The voltage across this one-port element is considered to 
be the plant’s input, or actuating signal. The plant is shown in Figure 8.15(a). 
The figure also illustrates the equivalence between a specific electric one-port 
element and the commonly used general block diagram representation for the 
plant. 

Since the plant transfer characteristics is defined as the ratio of its response 
to the excitation, we see that the description of a nondynamic plant yields a 
simple functional relationship. Now assume that the one-port plant is simply a 
resistor as shown in Figure 8.15(b). The plant transfer characteristic reduces to 
the following constant 



Thus, a static plant can be described by the transfer characteristics P. In the 
case of a static and linear process chosen here as an example (the resistor), the 
transfer characteristics is simply equal to the resistor’s conductance, G. 

Dynamical plants such as the capacitor shown in Figure 8. 15 (c) cannot be 
characterized by constants, not even by complex time-invariant functions because 
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their responses, being solutions of differential or integrodifferential equations, are 
functions of time. The transfer function for such plants is defined as the ratio of 
the integral transforms of the output to the input signal. The Laplace transform, 
denoted here as <!£{•} is the most commonly used transform to express the input 
and output signals of linear plants. Applying Laplace transform theory leads to 
the following transfer function for the capacitance plant of Figure 8. 15 (c): 

I (s) 

w = (8.11b) 

where 

I(s) = 

V(s) = æ{v(t)} 

denote transforms of current through the capacitor and voltage across it, respec- 
tively. Thus, the transfer function as in (8.11b) can be understood to be a ratio 
of the Laplace transform of current to the Laplace transform of voltage. 

The reader may have noticed that the ratio of output response to the input 
signal yields the transfer characteristics of a static plant. Similarly, the ratios of 
the transforms of output to input yield the transfer function of a dynamical plant. 
We also should realize that the concept of a transfer characteristics is of no use 
for dynamical plants. 

The significance of the transfer characteristics in system and control theory 
is based on the faet that response of a static system to an arbitrary excitation 
can be computed as a product of its transfer characteristics and of the excitation. 
Similarly, the transform of the response of a dynamical system can be expressed 
as a product of the transfer function and transform of the excitation. 

For a plant that is completely known, any desired relationship between in- 
put and response can be realized by a simple open-loop control configuration as 
shown in Figure 8.16(a) (Mishkin and Braun 1961). The feedforward controller 
of Figure 8.16(a) consists of two parts connected in cascade. The second one 
cancels the process transfer function because its own transfer function has been 
chosen as the inverse of the known process transfer function. In addition, the 
front part of the controller introduces the desired overall transfer function of 
the controller and the plant equal to D. Thus, the controller ’s transfer function 
should be equal to D/P. This approach is known as inverse control since the most 
essential part of the controller should provide the inverse of the plant ’s transfer 
function. The transfer function of the inverse feedforward control system from 
Figure 8.16(a) computed between its input and output becomes 

Y(s) 

W) =D{S) (8 - 12a) 

In this approach involving the cascade connection of the controller driving the 
plant and involving no feedback, the neural controller that acts as the plant 
inverse needs to be designed. 
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Figure 8.16 Control of process P to achieve the desired response: (a) inverse feedforward 
control and (b) feedback control. 


Realize, however, that the inverse control method fails in any of the two 
important circumstances: (1) when the transfer function P(s) or the transfer char- 
acteristics P varies during the course of normal operation of the plant, and (2) 
when P(s) or P is not completély known in advance of design. We customar- 
ily tum to the feedback control systems to circumvent any or both of the two 
limitations mentioned. One of the most common feedback control arrangements 
is shown in Figure 8.16(b). That such a configuration is directly useful if P(s) 
or P varies or is unknown is clear from simple analytical considerations. The 
feedback control system of Figure 8.16(b) is described with the transfer function 
between input and output 


m = GQ)/>Q) 

X(s) 1 + G(s)P(s)/D(s) 


(8.12b) 


Note that the system designer can choose GO) and D(s). If G(s)/D(s) is made 
sufficiently large so that a quantity known as the loop-gain and denoted LG 
satisfies the condition 


A G(s)P(s) 
D 0) 


> 1 


(8.13a) 


I s—j<o 
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for all frequencies w of interest, then the transfer function (8.12b) reduces to 
the form approximately equal to (8.12a). Using the condition (8.13a), we easily 
obtain from (8.12b) 

^ ^ D(s) (8.13b) 

X(s ) 

This result indicates that for high and fixed loop-gain values, the output now 
becomes independent of the plant transfer function P(s ), as desired. Therefore, 
the use of feedback provides a useful means for overcoming both the param- 
eter variation of the plant and the ignorance of the designer as to the exact 
nature of the plant and its transfer function P(s). These observations also apply 
to nondynamic plants. However, as mentioned earlier, for nondynamic plants, 
transfer functions G (s), P(s ), and D(s) in (8.12) and (8.13) become independent 
of complex variable s . 

In this section we will approach the design of the controllers using the 
estimated information about the plant. We will assume that such information is 
gradually acquired during plant operation. Such Control systems may be called 
learning control systems. Neural networks fit quite well into such a framework 
of control systems study. Plant identification techniques that make use of neural 
network techniques are reviewed first. Neural network applications for control of 
a nondynamic plant are then discussed. An example of a perceptron-based con- 
troller for a low-order dynamical plant will also be designed. The input signal to 
the perceptron controlling the dynamical plant, however, contains not only the 
present output of the plant but also its most recent output preceding the time 
the control decision was made. Finally, the discrete-time system control of the 
dynamical plant using the neurocontrol concept is studied. The concept involves 
adaptive control similar to the cerebellar model articulation controller (CMAC). 


Process Identification 

The introductory discussion above points out that plant identification can be 
distinctly helpful in achieving the desired output signal of the plant. In general, 
identification of the plant should result in usable terms such as coefficients of 
a plant differential equation or coefficients of its transfer function. The issue 
of identification is perhaps of even greater importance in the field of adaptive 
control systems. Since the plant in an adaptive control system vanes in operation 
with time, the adaptive control must be adjusted to account for the plant varia- 
tions. Neural networks used for identification purposes in this section typically 
have multilayer feedforward architectures and are trained using the error back- 
propagation technique. 

The basic configuration for forward plant identification is shown in Figure 
8.17(a). A nonlinear neural network receives the same input x as the plant, and 
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(a) (b) 


Figure 8.1 7a, b Neural network configuration for plant Identification: (a) forward plant identifica- 
tion and (b) plant inverse Identification. 


the plant output provides the desired response d during training. The purpose 
of the Identification is to find the neural network with response o that matches 
the response y of the plant for a given set of inputs x. During the identification, 
the norm of the error vector, ||d - o||, is minimized through a number of weight 
adjustments. The minimization is achieved using leaming techniques discussed in 
previous chapters. Figure 8.17(a) shows the case for which the network attempts 
to model the mapping of plant input to output, with both input and output mea- 
sured at the same time. This is understandable since the trained neural network is 
feedforward and instantaneous; thus, we have o(f) = \(t). However, more com- 
plex types of models for plant identification can also be employed. To account 
for plant dynamics, the input to the neural network may consist of various past 
inputs to the plant (Barto 1990). These can be produced by inserting the delay 
elements at the plant input. 

In earlier discussions, we leamed that the identification of the plant inverse 
may offer another viable alternative for designing the control system. In contrast 
to forward plant characteristics identification, now the plant output y is used as 
neural network input, as shown in Figure 8.17(b). The error vector for network 
training is computed as a x — o, where x is the plant input. The norm of the 
error vector to be minimized through leaming is therefore ||x — o||. The neural 
network trained this way will implement the mapping of the plant inverse. Once 
the network has been successfully trained to mimic the plant inverse, it can be 
used directly for inverse feedforward control as illustrated in Figure 8.16(a). 
The properly trained neural network acts as controller in this configuration, 
and the appropriate actuating signal for the plant is directly available at the 
network ’s output. 

In both identification cases described, the neural network is trained to copy 
either the forward or inverse characteristics of the plant. Typically, multilayer 
feedforward networks are used for this purpose. A trained network as shown in 
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Figure 8.17c Neural network configuration for plant Identification (continued): (c) trained neural 
network architecture. 


Figure 8. 17 (c) can model the behavior of the actual plant in the forward or in- 
verse mode of operation. Each of the identification modes has its advantages and 
disadvantages. Forward plant identification is always feasible; however, it does 
not immediately allow for construction of the plant controller. In contrast, plant 
inverse identification facilitates simple plant control; however, the identification 
itself is not always feasible. 

A major problem with plant inverse identification arises when the plant 
inverse is not uniquely defined. This occurs for a plant when, as in Figure 8.18, 
more than one value of x exists that corresponds to one value of y. This partic- 
ular condition coincides with the nonexistence of unique inverse mapping of y 
into x. Figure 8.18 illustrates this limitation of the plant inverse identification for 
the one-dimensional case. In the discussed case, the neural network modeling the 
plant inverse attempts to map a single input y* to one of the two target responses 
x x or x 2 - It may be that the eventual mapping leamed would somewhat tend to 
average the two desired x values, but in no case can such identification of the 
plant inverse be considered adequate. 

As an example of plant inverse identification, consider a static plant that 
converts polar (r, 6) to Cartesian (x, y) coordinates during its forward operation. 
The trained neural network should convert Cartesian to polar coordinates. A 
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Figure 8.18 Plant inverse identification example: (a) plant inverse characteristics exists and 
(b) plant inverse characteristics does not exist. 


two-layer feedforward architecture is chosen with two active inputs plus single 
bias input, 10 hidden-layer neurons, plus a single hidden biasing neuron, and 
two output neurons (Psaltis, Sideris, and Yamamura 1987, 1988). Hidden-layer 
neurons with unipolar activation function are selected with A = 1. Two output 
neurons are chosen with the linear activation function f(net) = net of unity slope 
so that they provide an unlimited range of output values. Initial weight values 
are chosen randomly at ±0.5, and the leaming constant used for training is 0.1. 

The simulated leaming experiment of plant inverse identification involves 
10 points spread within the first quadrant such that r e [0, 10], and 9 e [0, 7r/2] 
(Psaltis, Sideris, and Yamamura 1987). The training set points are marked by 
circled g letters on Figure 8.19(a). The diagram shows the contour map of map- 
ping error ||x — o|| 2 , in Cartesian coordinates, before the training. Figure 8.19(b) 
shows the contour map of the error after 100 leaming cycles. It can be seen that 
the conversion error has been reduced at and around the training points. Figure 
8.19(c) indicates that the error after 1000 cycles has been totally suppressed at 
the training points. Continued training, however, gives rise to a problem shown 
in Figure 8.19(d). After 10 000 training cycles high-error pockets were observed 
outside the training areas. This phenomenon points out that the generalization 
ability of the network has actually decreased due to excessive training. Poor 
generalization also indicates that either the number of the training points is too 
low, or that the chosen architecture is not adequate for the problem requirements. 

Let us note that the described training experiment is very similar to the 
one covered in Example 4.3. Indeed, the plant identification problem discussed 
here corresponds to a two-variable function approximation leaming. The accurate 
values of the function to be approximated are outputs produced by the plant to 
be identified. These outputs are desired responses of the multilayer feedforward 
network, which undergoes the training. 



This description of identification procedures and example of plant iden- 
tification overlook many details of real and complex identification problems. 
In particular, it does not completely cover the identification of the dynamical 
systems. It presents, however, the basic guidelines for plant modeling, which 
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has many uses in signal processing and control. One usually assumes here that 
the model has a general mathematical form, and then its unknown parameters 
are estimated. Forward and inverse plant identification configurations seem to be 
somewhat natural applications for multilayer feedforward neural network models. 
The plant parameter estimation is implemented using the pairwise arranged ex- 
amples of plant input /output values. The role of the teacher in plant identification 
is played by the system being modeled. 


Basic Nondynamic Learning 
Control Architectures 

Let us review a specific method of control in which multilayer neural 
networks are used as controllers of nondynamic plants (Psaltis, Sideris, and Ya- 
mamura 1987, 1988). The following discussion is closely related to the plant 
identification techniques discussed in the preceding section. The training of the 
neural network-based controller is also discussed in order to demonstrate the 
feasibility of the proposed architecture. 

Figure 8.20 shows the feedforward controller implemented using a neural 
network. Neurocontroller B is an exact copy of neural network A, which under- 
goes training. Network A is connected in such a way that it gradually leams to 
perform as the unknown plant inverse. Thus, this configuration is implement- 
ing the inverse feedforward control similar to Figure 8.16(a). The input of the 
controller is d, which is the desired response of the plant. The plant’s actual 
response is y due to its input x produced by the neurocontroller B. The error 
used for training the neural network is the difference between the output signals 
of networks A and B. Although the network B tracks A after each training step, 
y must exactly match d for this error to reduce to zero. We may note that this 
configuration becomes useless if the plant inverse is not uniquely defined, as 
illustrated by an example in Figure 8.18. 



Figure 8.20 Feedforward control with plant inverse learning. 
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(b) 


Figure 8.21 Specialized on-line learning control architecture: (a) static plant and (b) dynamical 
plant. 


This control architecture is also called an indirect learning architecture. The 
distinet advantages of this configuration are obvious. The controller network can 
be trained on-line since it undergoes training while its copy performs the useful 
control funetion work. Thus, separate training sessions are no longer needed. 
Moreover, the inputs to the neurocontroller are the desired outputs of the plant. 
Thus, the controller ’s training can easily be performed in the region of interest 
of the output vector domain. Such training in the region of interest can be called 
specialized training. In addition, the neural network leams continually and is 
therefore adaptive. 

A closely related control architecture for control and simultaneous special- 
ized learning of the output domain of the static plant is shown in Figure 8.21(a). 
This control mode possesses all advantages of the control mode displayed in 
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Region 2, Fig. 8.19 error 



Inverse plant identification (I) 

Specialized leaming (S) 

IOxI + S 

Figure 8.22 Leaming profiles for leaming architectures from Figures 8.17 and 8.21. 
[©IEEE; adapted from Psaltis, Sideris, and Yamamura (1988) with permission.] 


Figure 8.20, but it uses only a single copy of the neural network instead of the 
two required by the control system of Figure 8.20. 

The control objective of the specialized leaming control architecture from 
Figure 8.21(a) is to obtain the desired output d of the plant under the condition 
that the plant input causing this output is unknown. For a dynamical plant, the 
trained neural network needs to be provided the sequence of recent output plant 
vectors in addition to present plant output. This would allow for reconstruction 
of the current State of the plant based on its most recent output history. Figure 
8.2 l(b) shows a specialized leaming control architecture of the dynamical plant 
of order k. In this case, the last k output vectors of the controlled plant are 
augmenting the present input of the trained neural network. It should be under- 
stood that the delay units A placed in the feedback loops denote the time elapsed 
between the samples of inputs used for training, and that the plant is controlled 
in real time. Under such conditions, k can be referred to as the order of the 
dynamical plant (Saerens and Soquet 1991). 

Figure 8.19 illustrates three rectangular regions of specialized leaming se- 
lected for the static neural network trained using the configuration shown in 
Figure 8.2 l(a). Ten points have been used within each rectangular leaming spe- 
cialization region as shown at the bottom of Figure 8.19. The leaming profiles 
based on error |}d — y|| are shown in Figure 8.22 (Psaltis, Sideris, and Yamamura 
1988). The error analyzed and shown in the figure has been averaged for 10 
points in region 2. The figure shows that errors for the plant inverse identification 
method (continuous line) and specialized leaming (dashed line) are comparable. 
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Figure 8.23 Back-propagating error information through the plant. 


In addition, the leaming experiment consisting of 10 cycles of plant inverse 
Identification preceding the specialized leaming has been carried out. The results 
have shown that this hybrid leaming method may have a distinet advantage 
over each of the two leaming methods applied separately. Combining the two 
leaming approaches has resulted in the lowest error compared to other leaming 
architectures. However, even with a series of extensive simulated experiments, 
it has not been possible to determine conditions under which the two leaming 
methods have to be combined to achieve the best leaming results. 

The specialized on-line leaming control architecture of Figure 8.21 suffers 
from one major weakness, which may result in slower training or other undesir- 
able characteristics if the plant is known only approximately. We cannot apply 
the error back-propagation training for this configuration direetly because only 
the error between the accessible plant output y and its desired output d is known 
to the user. The error d — y must be reduced to the input of the plant first. 

The neural network to be trained using the error available at the plant 
output as shown in Figure 8.21 has been drawn in more detail in Figure 8.23. 
The customary error back-propagation rule (4.5a) for weight adjustment within 
the neural network can be rewritten in the form involving neural network outputs 
o; as follows: 

dwn = for i = 1, 2 n, j = 1, 2, . . . , m (8.14) 

1 døj dw^ 

where the plant output error E explicitly available from the teacher can be ex- 
pressed as 

1 m 

E=r X(4-y,) 2 

2 ,=i 


(8.15) 
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To compute the error E reflected at the plant’s input we assume that the plant is 
described with the following algebraic equations: 

y t = for i = 1, 2, . . . , m (8.16) 


The output error E propagates back to the input of the plant and SE / do h i = 1, 
2, . .., n, can be computed from (8.14), (8.15), and (8.16) as follows: 


do i do i do i do i 


dE „ , „ dy, , , „ dy 2 

' t - (d\ — + (d 2 — + 

o (?2 d #2 o 6>2 


i / j \ dy m 


(8.17) 




The expressions (8.17) can be rewritten in matrix form as below. 

- v ” £ *0' (d - rt 


where 


and 


f dE 



do j 
dE 

do 2 


dE_ 

, do n 


(8.18) 




for i = 1, 2, 


n, j = 1, 2, w 


is the m X n Jacobian matrix of the plant. It can be seen from (8.18) that the 
plant output error d — y propagates back through the plant according to the 
transpose of the plant’s Jacobian matrix. This matrix describes the performance 
of the linearized plant (8.16) taken at the plant’s operating point. 

If the plant equations (8.16) are known, the Jacobian matrix can be easily 
produced in an analytical form at the plant’s operating point. If the plant is not 
known, the array of partial derivatives dy, / doj can be determined by deviating 
the plant’s input j by the amount A Oj from its operating point and then measuring 
resulting changes A y u A y 2 , ...» A y m . This approach is called the perturbation 
method and allows for measurement or simulation of approximate values of the 
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Jacobian matrix entries. The perturbation method yields approximate values only 
of dy i / doj since it involves finite differences A y : and Aoj rather than infinitesimal 
differentials. However, the method is always applicable. Also, the evaluation of 
Jacobian matrix entries can be performed using the perturbation method either 
computationally or experimentally. 


Inverted Pendulum Neurocontroller 

In this section we will review the capability of a simple neural network to 
leam how to balance an inverted pendulum. The inverted pendulum (also called 
“broomstick balancing”) problem is a classical control problem that has been 
extensively studied by many researchers and seems to be well understood. Also, 
the problem is representative of many typical dynamical plant control problems. 
Thus, designing the controller for this inherently unstable system will allow us 
to approach and solve a number of other, similar control tasks. The complexity 
of the task of inverted pendulum balancing is significant enough to make the 
problem interesting while still being simple enough to make it computationally 
tractable (Tolat and Widrow 1988). 

A number of control techniques using neural networks to balance an inverted 
pendulum have been investigated (Barto, Sutton, and Anderson 1983; Anderson 
1989). In the approach presented below we will use the simple and synergistic 
approach, which examines the merger of visual image acquisition, training, and 
neurocontrol (Tolat and Widrow 1988). Since the plant is inherently dynamical, 
the control must involve the plant’s present and past State information. 

The goal of the inverted pendulum task is to apply a sequence of right 
and left forces of fixed magnitude such that the pendulum is balanced, and the 
cart does not hit the edge of the track. The cart-pendulum physical system is 
shown in Figure 8.24(a). The cart and pendulum are constrained to move only 
within the vertical plane. Four time variables describe the State of the system: the 
horizontal position and velocity of the cart (x, v), the angle between the pendulum 
and vertical, and the angular velocity of the falling pendulum (0, o>). The force 
required to stabilize the system is a function of four State variables x, v, 0 , and 
o) as follows: 

F(t) = ax(t) + bx(t) 4- c6(t) + dé(t ), ( 8 . 19 ) 

where x(t) = v(t), 6(t) = a>(0 

and a, b, c, and d are constant coefficients, which need to be found for each 
system since they are dependent on its physical characteristics. 

Equation (8.19) approximates the motion of the system and assumes that it 
can be characterized by linear differential equations. Knowing the coefficients a, 
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x - cart position 
x - cart velocity 




Figure 8.24 The inverted pendulum illustration: (a) physical system to be controlled and 
(b) two examples of computer-generated system images: left — real image; right — quantized 
image. [©IEEE; adapted from Tolat and Widrow (1988) with permission.] 


b , c , and d and solving the system’s modeling equation (8.19) would allow the 
user to compute the force that balances the plant. It is implied here, however, 
that neither Eq. (8.19) nor its coefficients are known both for the controller and 
for the observer. Let us consider that the only information available is about the 
order of equation and its general form. Since the equation is of first order, it can 
be expected that providing the present and most recent plant output value would 
enable us to train the controller adequately. As a result, the controller would 
need to generate the positive or negative force that would possibly stabilize the 
pendulum at its vertical position. 

Since the balancing force can be assumed to be constant and of fixed mag- 
nitude for each push-pull action, and it differs only by direction, it may be 
postulated that the following force will stabilize the solution of the differential 
equation (8.19): 


F(t) = asgn 


ax{t) 4- bx{t) 4- c6(t ) 4- dé(t) 


( 8 . 20 ) 


where a is a positive constant representing the force magnitude. Though the 
cart and pendulum velocities are involved in (8.20), they are more difficult to 
measure than linear distances and angles. To circumvent this difficulty, Equation 
(8.20) can be rewritten in the form of a difference equation: 


F(t) = a sgn 


ia 4* b)x(t) - bx(t — 1) + (c 4- d)0(t) — ddit - 1) 


( 8 . 21 ) 


This equation approximates the value of the balancing force F(t) at the present 
time. It is assumed here that x(t— 1) and 0(t— 1) are cart positions and pendulum 
angles at an earlier time preceding the evaluation of the plant and the computation 
of force. 
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The dynamics of the system remains largely unknown during the process 
of neurocontroller design. The knowledge of the system’s dynamics is limited 
to the vague form of the difference equation (8.21) and it does not include its 
coefficients. In addition, the human observer who would measure cart positions 
and angles of pendulum at every instant is replaced by the input of the visual 
image of the cart and pendulum. Since it is essential to provide the controller 
with the present and most recent position and angle data, the present and most 
recent images arranged in pairs have been used to train the network (Tolat and 
Widrow 1988). 

The neural network training was implemented by software simulation with 
the input of binary images providing the required State information x(t), x(t — 1), 
Q(t ), and 6(t - 1). Examples of the quantized binary images are shown in Figure 
8.24(b). The images were generated on a Macintosh computer in the form of 
5X11 pixel maps of the cart and pendulum. Images of finer granularity than 
5 X 11 pixels were not used since they required extra computational effort and 
resulted in no improvement in controller performance. Pictures of smaller size, 
in contrast, did not provide an adequate resolution and input representation for 
the solution of this problem. 

For the image resolution used and shown in Figure 8.24(b), there were 65 
different images involving all possible combinations of 5 distinet cart positions 
with 13 different pendulum angles. Since the training objective was to produce 
F(t) as in (8.21), both the present image and the most recent one representing the 
cart with pendulum at an earlier time, t— 1, were used. The time elapsed between 
two consecutive State recordings, equal to the time between two successive force 
applications was 100 ms. This time and graphics resolution were found to be 
sufficient to balance the actual mechanical system investigated. With the two 
images completely combined, there were 65 2 = 4225 training vectors in the 
training set. Each input vector used for training consisted of 2X5X11 = 110 
binary entries, not including the bias input. 

Let us note how this method of controller design differs from more tradi- 
tional control approaches. First, the plant ’s modeling equation (8.19) would have 
to be known to design a conventional controller. Specifically, the task would 
require finding initially unknown coefficients a, fe, c, and d in (8.19) or (8.21). 
Analytical computation of force F(t) would further require solution of the differ- 
ential equation (8.19). Altematively, the difference equation (8.21) obtained from 
(8.19) could be solved by the controller to approximate the value of F(t) needed 
to provide the balance of plant. 

The solution of (8.21) in the described experiment has been obtained without 
finding coefficients a , fe, c, and d, and even without solving any equations, or 
measuring position and velocity of the cart, or even measuring the angle or 
angular velocity of the pendulum. Experiments have shown that a single discrete 
perception with 110 weights and one threshold weight can be trained to find 
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the solution of (8.21). To train the perceptron, 4225 examples of image pairs 
were submitted along with the teacher’s decision as to the desired F(t) value for 
the specific set of observations x(t), x(t — 1), 0(0, and 0(t — 1). The teacher’s 
input determines the sign of force (F = 1 or F = -1) needed to prevent the 
pendulum from falling. This input is based on the study of the present and most 
recent image. Note that both the present and most recent images on x and 0 
are presented as input pattems from the training set. Supervised training has 
replaced solving the system of 4225 linear equations with 111 unknowns. This 
has eliminated finding the best fit of unknowns in this overdetermined system 
of equations. 

In the described control design experiment, the Widrow-Hoff training al- 
gorithm has been used to train the weights of the single perceptron controller 
consisting of 111 weights, one summing node, and one TLU. In addition, the 
network has proven capable of extracting necessary State information from crude 
visual images arranged in a time sequence without the need for measurements 
of motion coordinates or physical parameters of the system. 

The leaming profile for computation of the applied force in the experiment is 
shown in Figure 8.25(a). The error of force sign drops dramatically after several 
passes through the entire training set; then it hattens out. The reported error after 
1000 training cycles was 3.4%. The example of final weight values coded into 
pixel map shade and intensity is shown in Figure 8.25(b). The sign of force F is 
found by combining the pixel images of the system with weights and creating an 
appropriate scalar product of the two 110-tuple vectors plus the threshold value. 

In the next series of simulation experiments with inverted pendulum bal- 
ancing, the training focused on analyzing the generalization capabilities of the 
trained controller. Only a fraction of the total of 4225 images was selected at 
random for training purposes. The training subsets were of different sizes and 
varied from 10 to 100%. Results were computed for four different training sets 
and then averaged. The final error after 300 training cycles is plotted in Figure 
8.25(c) for six different training set sizes. The final error values indicate that the 
trained network is able to generalize fairly uniformly and well for training sets 
consisting of more than 40% of the pattems of the complete set. 

The neurocontroller designed for keeping the pendulum from falling was 
able to perform successfully despite occasional failures. A single discrete percep- 
tron was proven to be adequate for this task. More complicated control problems, 
however, will require larger neural networks with appropriate inputs (Fu 1986). 
The network used in this case has not made use of explicit decision, or control, 
mies. The only mies used were common sense teacher guidelines as to how to 
keep the pendulum vertical. The example remarkably demonstrates that a teacher- 
machine interaction can be gradually replaced by observing the environment and 
providing responses suitable for the observations. After training, the controller 
should be able to perform the control task autonomously by responding correctly 
to most situations, even those that were not specifically used for training. 
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Figure 8.25 Training of the cart-pendulum controller: (a) leaming profile example, (b) image of 
controller’s final weights, 4225 images used, and (c) analysis of generalization capability of the 
trained controller. [©IEEE; adapted from Tolat and Widrow (1988) with permission.] 
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Cerebellar Model Articulation Controller 

This section discusses the cerebellar model articulation controller (CMAC). 
Both the concept and the application case study for low-order CMAC control 
problems are reviewed here based on work by Kraft and Campagna (1990a, 
1990b). The basic idea of the CMAC approach is to leam an approximation of 
the system characteristics and then to use it to generate the appropriate control 
signal. The approximation of the system characteristics is understood as gradual 
leaming based on the observations of the plant input-output data in real time. 
The leaming of a CMAC controller proceeds somewhat like the winner-take-all 
competitive leaming discussed in Sections 2.5 and 7.2, however, without the 
competition enforcing the weight adjustment. 

Let us consider the case of a first-order plant that needs to be controlled to 
yield a known, desired output signal. A sample discrete-time first-order plant is 
described with the difference equation 

y(k + 1) = ay(k) + bx(k) (8.22a) 

where y(k) is the system output, x(k) is the input, k is the time step number, and 
a and b are constant coefficients. Each point in plane y(k), y(k + 1) corresponds 
to a required value of control input x(k). Interpreting the present and the next 
following output y(k), and y(k + 1), respectively, as two independent variables, 
input x(k) can be represented as a so-called characteristic surface of the plant. 

Two examples of characteristic surfaces are illustrated in Figure 8.26. For 
the linear plant as in (8.22a) with a = 6 = 1, the planar characteristic surface is 
shown in Figure 8.26(a). Nonlinear systems exhibit more diversified characteristic 
surface shapes that cannot be expressed through a linear form as in (8.22a). A 
sample nonplanar surface for the equation 

y(k + 1) = y(k) + [jt(*)] 1/3 (8.22b) 

is shown in Figure 8.26(b). Obviously, both for the linear or nonlinear plant that 
is completely known, the characteristic surface can be calculated and used to 
produce the present control input value x(k). A simple look-up table or formulas 
such as (8.22) for x{k) values at each coordinate pair y(k ), y(k + 1) would pro- 
vide, without computation, the appropriate value of the desired control signal. In 
a more realistic situation, however, when the system parameters are not known 
in advance or they vary in time, the characteristic surface can be leamed grad- 
ually from the plant input-output data. This can be accomplished in an on-line 
control mode. 

The block diagram of the complete leaming and control system is shown in 
Figure 8.27. Let us assume a first-order discrete-time plant. For such a plant, the 
control signal x(k) is determined by a pair of values y(k) and y(k +1), which are 
the present and next plant outputs, respectively. The sequence of values y(k) 9 for 



r ts* 


8.3 NEURAL NETWORKS CONTROL APPLICATIONS 


505 


Discret© In te g rotor No n l l n e a r Sy s lønn 
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Figure 8.26 Examples of plant characteristic surfaces: (a) planar surface (8.22a), a = b = 1 
(first-order linear plant) and (b) nonplanar surface (8.22b) (nonlinear dynamical plant). 


k = 1, 2, 3, ...» is assumed to be known and is provided by the reference plant 
output generator. 

The plant Control signal x(k) is initially produced by a conventional con- 
troller. The controller is used in a negative feedback configuration similar to 
the one shown in Figure 8.16(b). The reference plant output generator inputs 
the present output value of the plant to the summing node 51. This input is 
approximately equal to the actual output of the plant due to the faet that the 
closed-loop system between the reference plant output to the summing node 51 
and the plant output has a transfer funetion of approximately unity value. For 
this condition to be valid, the closed-loop control system consisting of classical 
controller, plant, and node 51 must be both stable and it must use a high-gain 
classical controller. Note also that the neurocontroller provides little or no input 
to the summing node 52, at least at the initial training stage of the entire control 
system. 

In the initial stage of operation, the plant derives almost all of its input from 
the classical controller. The neurocontroller, called here the CMAC memory, is 
not yet trained, and it initially outputs a zero value to 52. Therefore, initially, the 
classical controller output is the dominant component of the total actuating signal 
x(k). In each control step, however, the control signal produced by the classical 
controller, x(k) 9 is used to build the value of the CMAC characteristic surface 
x(k) having current coordinates y(k), y(k 4- 1). The characteristic surface adapts 
in each control step according to the following leaming law, which is similar to 
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Figure 8.27 CMAC neurocontrol block diagram. 


the leaming rule (2.46): 

w(k + 1) = w(k) + a[x(k) — w(k)] ( 8 . 23 ) 

In this weight adaptation formula, a > 0 is a small positive leaming con- 
stant, k denotes the leaming (update) step number and also the number of the 
control step. As pointed out, the initial weight values are set to zero, which 
corresponds to setting the entire characteristic surface to zero at the beginning of 
training. The weights w(k), which are built by the conventional controller, can be 
interpreted as iteratively updated control signal values x(£), or memorized values 
of the characteristic surface that needs to be leamed. The leaming as in (8.23) 
should eventually end up at leamed values w, which are averaged x(k) values 
over a number of CMAC controller leaming steps. 

Despite the lack of competition between the neurons and the absence of the 
winner, the leaming as in (8.23) follows the unsupervised winner-take-ali leam- 
ing rule. This is to provide matching between vectors y(k), y(k +1) and vectors 
x(k). The winning neuron during leaming is always the one designated by the 
coordinates y(k ), y(k + 1), of the leamed characteristic surface. It is seen that 
CMAC characteristic surface leaming represents an interesting case of supervised 
look-up memory training. 

The described leaming is of the local type, and more precisely can be called 
pointwise, since the only coordinates at which leaming occurs are y(k ), y(k + 1). 
No information is conveyed to nearby memory locations. The generalization of 
leaming can be used, however, in order to update a neighborhood of the point 
to be leamed. To accomplish the leaming with neighborhood sensitivity, which 
can be also termed as leaming with embedded generalization, the memory values 
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After 

leaming step 


(b) 


Figure 8.28 Leaming of CMAC neurocontroller with generalization: (a) exampie of neighbor- 
hood leaming radius of 2 and (b) leaming at cross-section y*(k + 1) with generalization (two 
cases). 


w(k + 1) are assigned planar coordinates so that they become w t j(k + 1). It is 
now assumed that the memory has its values built over the discrete space with 
a certain neighborhood radius. 

Figure 8.28(a) shows neighborhood sensitivity in the shaded region of gen- 
eralized leaming at point A with the neighborhood radius of 2. Leaming at A in 
such cases and according to (8.23) is accompanied by the neighborhood leaming 
at 24 adjacent memory locations falling within the shaded region. Leaming at 
memory cross-section y*(k + 1) is displayed in Figure 8.28(b). The figure shows 
exampie memory adjustments at five neighborhood points. Curve (1) shows ad- 
justed memory values with equal adjustments for each of the updated locations 
independent of their distance from A. Curve (2) exemplifies tapered leaming with 
adjustments of w^k + 1) inversely proportional to the distance between the point 
A and the location i, j being adjusted within the neighborhood. 

Figure 8.29 shows exampie leaming of a characteristic surface without 
and with neighborhood generalization. Figure 8.29(a) depicts pointwise leaming 
in several isolated locations of the controlleres memory. Figure 8.29(b) depicts 
untapered neighborhood leaming with uniform generalization built into the char- 
acteristic surface. It can be seen that the neighborhood weights are adjusted in 
this case by the same amount as for pointwise leaming. 

During the leaming phase, the control signal coming from the classical con- 
troller of fixed high-gain is added in node 52 to the feedforward neurocontroller 
output. This output gradually increases from zero values as the weights are trained 
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Figure 8.29 Learning of the characteristic surface: (a) pointwise learning mode (without gener- 
alization) and (b) neighborhood learning mode (with generalization). [©MIT Press; reprinted from 
Kraft and Campagna (1990b) with permission.] 


to approximate the plant and as the characteristic surface is unfolding. As leam- 
ing continues, the portion of the control signal derived from the neurocontroller 
increases. More precisely, the memorized weights of the CMAC controller are 
built up during control. As a result, the recalled control signal from the neural 
look-up memory gradually takes control of the plant. The conventional controller 
is excited by the error signal produced at the output of the summing node 51. 
Since this error signal gradually diminishes as the neurocontroller takes over the 
control function, the error signal input is reduced even further. As a result, the 
conventional controller is gradually eliminated from the control action. 

For an accurately trained neurocontroller having the block diagram shown 
in Figure 8.27, the error signal of the classical controller reduces to zero, as does 
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its output signal. An immediate consequence of the feedforward CMAC neuro- 
controller taking over the control function is the increased speed of the system ’s 
response. In the presence of noise, or variable plant parameters, however, the 
normally idle classical controller would assume partial control of the plant to 
compensate for partial uncertainties that may be present in the system but have 
not been leamed from earlier experience. 

The neurocontroller using the described concept has been designed to con- 
trol a linear discrete integrator plant described by (8.22a) with a = b = 1 
(Kraft and Campagna 1990b). The desired response y{k) of the sample plant has 
been chosen to be the response of the unity-gain discrete integrator-filter with 
unity low-frequency gain and a pole at z = 0.1. The difference equation of the 
integrator-filter, which produces the desired sequence y{k) at the output of the 
plant, is 

y(k + 1) - 0Ay(k) - 0.9 r(k) = 0 (8.24) 

A unity magnitude, bipolar square wave of frequency 1 Hz has been chosen 
as the reference wave r(t). Samples r(k) of the continuous wave r(t) have been 
assumed to be taken with frequency 9 Hz. Also, the plant was considered to 
be at rest initially. The reader can verify that this set of assumptions allows for 
computation of the entire sequence of plant output samples y(k ), for k = 0, 1, 2, 
. . . , using Equation (8.24). 

Three experiments for leaming the characteristic surface with simultaneous 
control of the plant have been performed (Kraft and Campagna 1990a, 1990b). 
The first experiment was performed with noise-free input. The plant input and 
output measurements in the second experiment were contaminated with a signif- 
icant amount of noise. The noise was white and uniformly distributed between 
-0.4 and 0.4. The third experiment involved no noise, but the plant to be leamed 
was nonlinear. The plant had the linear characteristic surface augmented by the 
nonlinear term — 0.4[y(£)] 1/3 to the form 

y(k + 1) = y(k) - 0.4|>(£)] 1/3 + x(k) (8.25) 

Each of the three networks has been trained for 625 weights, or memory lo- 
cations. The characteristic surface leamed in the first experiment for a = 0.3 
and in the generalized leaming mode is shown in Figure 8.30(a). The CMAC 
surfaces leamed in the two remaining cases exhibit considerable similarities to 
the surface shown in the figure. The leaming covers the domain — 1 < y(k) < 1 
and —1 < y(k + 1) < 1. The reader may notice that the plant output must be 
constrained to this region because of the choice of both the reference wave and 
the form of the difference equation (8.24). Specifically, the parts of the char- 
acteristic surface that are close to comers (—1,1), (1,-1) in coordinates y(£), 
y(k + 1) indicate the largest and the smallest control signal values, respectively. 
Ideally, the leamed surface should have been exactly as shown in Figure 8.26(a). 
However, the leaming of the initially unknown surface remains notably localized. 
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y(k + 1) = OAy(k) + 0.9 r(k) (filter) 
x(k) = y(k + *\) - y(k) (plant) 
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Figure 8.30 CMAC learning: (a) characteristic surface learned and (b) control signals com- 
puted for the known plant. [Part (a), ©MIT Press; adapted from Kraft and Campagna (1990b) 
with permission.] 


For the coordinates y(k), y(k +1) that occur frequently, the learning resulted in 
local peaks; some other areas of the characteristic surface have not deflected 
from the initial value of w t j = 0. 

Figure 8.30(b) tabulates 12 initial steps of CMAC learning for the exper- 
iment described, with no input noise present. Plant outputs y(Æ), y(k +1) have 
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been computed from Eq. (8.24). The data tabulated in the plant input column 
A'(&) were obtained assuming a known plant rather than an unknown plant being 
identified using the signal provided by the conventional feedback controller. The 
controlled plant in this example is described by Eq. (8.22) with a = b = 1. 

The measured signal tracking performance of the actual CMAC is plotted 
in Figure 8.31. The solid line represents the actual plant output, the dashed line 
indicates the desired output y(k). It can be seen from the graphs that the trained 
controller’s performance has proven to be at least satisfactory. For the no-noise 
case, 25 leaming cycles involving a time frame of 25 full periods of the square 
wave r(k) have been performed prior to the 30 output samples displayed in the 
figures. Leaming with noise took longer, but the controller’s final performance 
shown in Figure 8.3 l(b) was approximately the same as if it had leamed without 
noise. Especially high performance was achieved for nonlinear plant leaming. As 
shown in Figure 8.31 (c), the tracking achieved was very good and output was 
very close to the required wave. 

As can be seen, the controller under discussion is leaming in a supervised 
mode. The controller’s training involves characteristic surface buildup through 
weights gradually matching the plant inputs needed to produce the desired out- 
puts. The neurons in this network are with the identity activation function and 
therefore they do not participate in the processing. 

The CMAC, compared to conventional adaptive controllers performing the 
control task as described, has shown distinet desirable characteristics such as 
accuracy of leaming and leaming speed. In addition, it ranks favorably in terms 
of actual control speed. Upon leaming, the control signal is instantaneously 
available, since it is derived direetly from the look-up memory, which stores 
characteristic surface values. Moreover, the neurocontroller contains no restric- 
tions regarding the plant and system linearity, and it seems to perform well in a 
noisy environment. 


Concluding Remarks 

The section on neural network control applications explores several central 
issues such as Identification, control of static and dynamical plants, and example 
designs of selected low-order neurocontrollers. Control of example plants oper- 
ating in both continuous- and discrete-time modes was considered. However, 
the exposition does not unveil a complete range of problems and ready-to-use 
solutions. Rather, basic concepts, problems, and solutions are outlined along 
with some promising directions for development of neurocontrol systems. The 
main reason for such a preliminary exposition is that the potential of artificial 
neural systems as neurocontrollers still has to be researched before it can be 
fully realized (Barto 1990). The study of the technical literature available seems 
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Figure 8.31 CMAC-generated plant response (dashed lines denote desired output): (a) plant as 
in (8.22a), a = b = 1, (b) same as part (a) but noise added, and (c) noise-free nonlinear plant 
as in (8.25). [©MIT Press; reprinted from Kraft and Campagna (1990b) with permission.] 
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to indicate that much has to be done before the systematic design of leaming 
controllers can be undertaken. 

At this stage of neurocontrol development, several areas of possible ex- 
ploration are worth emphasizing. Neural networks can be used as models of 
dynamical systems with outputs dependent on input history. Suitable architec- 
tures and more systematic modeling approaches need to be devised for this 
purpose. Neural networks can also serve as controllers, and they seem to be 
especially promising for nonlinear dynamical systems control. Acquisition and 
storage of control information seems to be well suited to the applications of 
associative memory concepts. Such neurocontrollers would take full advantage 
of associative memory systems having high capacity, high speed, and the ability 
to interpolate and extrapolate control data in real time (Barto 1990). In addition, 
theoretical and experimental investigations of the stability of neurocontrollers 
needs to be undertaken. This would include both leaming and control stability 
of such controllers. 

To achieve this aim, novel architectures and training approaches, and an ap- 
propriate mixture of conventional and neural control need to be developed. The 
choices of the traditional identifier and controller structures are usually based on 
the well-established results of linear systems theory. However, existing design 
procedures for stable nonlinear dynamical systems are valid only for system- 
by-system cases. Such procedures are currently not available for designing large 
classes of such systems. The first comprehensive theoretical and experimental 
results in the identification and control of nonlinear dynamical systems using 
neural networks have been reported only recently (Narendra and Parthasarathy 
1990, 1991). 

One particularly interesting and promising concept is that of so-called dy- 
namic back-propagation (Hecht-Nielsen 1990). Since most physical systems in- 
volve dynamics, their realistic modeling can involve a combination of linear 
dynamical systems and static multilayer neural networks. This combination can 
be considered as an extension of error back-propagation architectures for static 
systems. The study has shown that multilayer neural networks using dynamic 
error back-propagation can be very effective in performing identification and 
control tasks. The reader is encouraged to look to references for more advanced 
coverage of neurocontrol issues (Nguyen and Widrow 1990). 


8.4 

NETWORKS FOR 
ROBOT KINEMATICS 

Although neural networks applicable to the solution of robotics control prob- 
lems are, in faet, neurocontrollers, their funetion is specialized mainly to provide 
solutions to robot arm motion problems. In this section, neural network models 
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Figure 8.32 Geometry of planar manipulator with two degrees of freedom. 


are considered for solving a number of robot kinematics problems. Robot kine- 
matics involves the study of the geometry of manipulator arm motions. Since the 
performance of specific manipulator tasks is achieved through movement of the 
manipulator linkages, kinematics is of fundamental importance for robot design 
and control. Trajectory control and leaming of the robot arms motions to achieve 
the desired final position are the main kinematics tasks covered below. 


OverView of Robot Kinematics Problems 

Our focus in this section is to provide the design fundamentals for neuro- 
controllers used in robotics systems. Trajectory control of robotic manipulators 
traditionally consists of following a preprogrammed sequence of end effector 
movements. Robot control usually requires control signals applied at the joints 
of the robot while the desired trajectory, or sequence of arm end positions, is 
specified for the end effector. Figure 8.32 shows the geometry of an idealized 
planar robot manipulator with two degrees of freedom. The robot arms operate in 
a plane. To make the arm move, the desired coordinates of the end effector point 
(x,y) are fed to the robot controller so that it generates the joint angles (0 1? 0 2 ) 
for the motors that move the arms. To perform end effector position control of 
a robotic manipulator, two problems need to be solved (Craig 1986): 

1. Inverse kinematics problem: Given the Cartesian coordinates of the end 
effector, specified either as a single point or as a set of points on a 
trajectory, joint angles or a set of joint angles need to be found. 

2. Target position control: Given the final end effector position, a joint angles 
sequence suitable for achieving the final position needs to be found. 
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The kinematics considerations for the manipulator shown in Figure 8.32 are 
based on the forward kinematic equation. The forward kinematic equation in- 
volves mapping of joint angle coordinates (0 2 ,0 2 ) to the end effector position 
(x,y). The mapping expressions can be obtained by inspection of the figure as 
follows 


X = /jCOS Øj + / 2 cos (Øj + 0 2 ) 
y = ^sinØ! + / 2 sin(0 1 + 0 2 ) 


(8.26) 


where 6 X and 0 2 are the joint angles of the first and second arm segments, 
respectively, and l u and l 2 are respective arm segment lengths. Relation (8.26) 
expresses the forward kinematic problem and implements unique mapping from 
the joint angle Space to the Cartesian space. 

The inverse kinematic problem is described by (Craig 1986): 


0 2 = cos 


-i 


'(S + y 


‘ 2 ~ / 2 ~ l 2 
l 2 


Ihh) 


(8.27a) 


6 X — tan 1 (y/x) - tan 1 [/ 2 sin0 2 / (l x + / 2 cos0 2 )] (8.27b) 


Since cos" 1 is not a single-valued function in the range of angles of interest, 
two possible orientations typically result from (8.27) for the robot arm joint 
angles. The arm can be positioned with the elbow up or down, with the end 
effector still at the required (x,y) point. The inverse kinematic transformation 
(8.27) implementing mapping from Cartesian space to joint space is thus not 
unique. 

As can be seen from Equation (8.27), control of robot arms requires the 
solution of the inverse kinematic equation system in real time. In the general 
case of an end effector with n degrees of freedom, the forward and inverse 
kinematic problems can be formulated, respectively, as 

x = h(0) (8.28) 

6 = h -1 (x) (8.29) 


where 0 and x are joint angles and end effector Cartesian coordinate vectors, 
respectively, which are defined as follows: 

e=[*i e 2 ... o k ]‘ 

X=[*l *2 ••• X„\ 

Even though the numerical solution of (8.29) can be found, this requires large, 
real-time computational resources. The neural network approach may be used 
here to solve both the problem of kinematic task formulation and the setting up 
of the equations. It also allows for circumvention of the computational complexity 
involved in their numerical solution. 
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Solution of the Forward and Inverse 
Kinematics Problems 

The forward and inverse kinematics problems can be solved by multilayer 
feedforward networks as shown in the form of a block diagram in Figure 8.33. 
The network in Figure 8.33(a) needs to be trained to leam the set of end effector 
positions x from the given set of joint angle examples. The network in Figure 
8.33(b) is trained to leam the set of joint angles 0 from the given set of end 
effector position examples. Each of the neural networks involved performs static 
mapping of two two-variable functions as discussed in earlier chapters. 

The supervised leaming diagram of both configurations from Figure 8.33 is 
shown in Figure 8.34. The reader may notice that this configuration fails into 
the general category of forward plant identification first introduced in Figure 
8.17(a). The robot manipulator output d provides the teacher’s signal, or the 
desired output value, which is compared with the network actual output o. Based 
on the error value ||d — o||, the neural network weights are tuned to achieve the 
required accuracy of mapping. Note that the teacher’s signal d for the training 
set can be provided either by solving (8.28) or (8.29), or by taking the respective 
measurements of the Cartesian or joint coordinates of the actual manipulator. The 
leaming of the manipulator proceeds off-line. Once the network has been trained 
and installed in the system, no more training is allowed since the neurocontroller 
is placed in front of the robot manipulator as discussed in the following para- 
graph. Therefore, the set of training data should be uniformly distributed over 
the robot’s working area so that the network can make good generalizations for 
the intermediate points. 

Obviously, once the network has been trained to perform the required kine- 
matic transformation, it then needs to be employed in the inverse feedforward 
control mode. If the neural network has been trained to solve a forward kine- 
matics problem, its natural application is to control the manipulator, which needs 
Cartesian coordinates x as input and produces joint angles 0 as outputs. This 
is shown in Figure 8.35(a). The input to the neural controller is the vector of 
desired joint angles, Øj. After error-free training of the controller, we should have 
0 = Øj at the output of the robot manipulator. Using the trained network shown 
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Figure 8.33 Neural networks for robot kinematics transformation: (a) forward kinematics 
problem and (b) inverse kinematics problem. 
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in Figure 8.35(b) requires specification of the desired end effector coordinates x d . 
The network translates these coordinates to angles 0 and the robot manipulator 
should respond with x = x d . 

Let us review the forward kinematics trajectory leaming for the two degrees 
of freedom (planar) manipulator. The concept of “trajectory” is understood here 
as the end effector workspace rather than as a sequence of points to be followed. 
In designing trajectory we are therefore dealing with specializing the robot end 
effector movement in a specific work area rather than training it to reach sequence 
of points ending at a target position. Target position leaming is studied separately 
later in this chapter. 

In this example the robot is required to leam Cartesian coordinates x,y for 
the circular trajectory of radius r. A network with two inputs (Øj and 0 2 ), four 
hidden nodes, and two output neurons yielding outputs x,y has been selected 



Figure 8.35 Controlling robot manipulator using neural controller: (a) joint angles input and 
(b) target coordinates (end effector) input. 
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Figure 8.36 Forward kinematics solution for circular trajectory, r = 3: (a) response of the 
trained network and (b) mapping error for training data reused in the test mode. [©IEEE; 
adapted from Arteaga-Bravo (1990) with permission.] 


to simulate the expression (8.26) (Arteaga-Bravo 1990). The constraint on the 
circular trajectory, which needs to be leamed, is 


x = r cosjS 
y — r sin /3 


(8.30) 


where angle /3 is covering the first quadrant in this example. The training has 
been performed for = 3, l 2 = 2, and r — 3 at 10 points of the selected 
trajectory. The 2-4-2 network has been trained using the error back-propagation 
technique with leaming constant 17 = 0.9, and a = 0.9 for the momentum term. 
The training data of the trajectory with r = 3 contains angle j 8 between 5 and 
85°, spaced by 10°, and /3 = 90°. 

The trained 2-4-2 network has subsequently been evaluated using the train- 
ing data in the test mode. The results of training are shown in Figure 8.36(a). The 
figure illustrates an accurate mapp ing of a ngles into the desired end effector 
positions forming the trajectory \Jx 2 + y 2 = 3. The differences between the 
accurate forward kinematics solutions and those computed by the neural network 
are illustrated in Figure 8.36(b). The maximum magnitude difference, or max- 
imum absolute error, has been observed to be about 0.08 for the x coordinate 
at = 90°. 

A multilayer feedforward network with two hidden layers containing four 
neurons each has then been employed to perform the manipulator ’s inverse kine- 
matic transformation for the same circular trajectory. The 2-4-4-2 network has 
been selected for training with parameters rj = 0.9 and a — 0.3. The results of 
the evaluation of the trained network for the training data 0 X and 0 2 are illustrated 
in Figure 8.37(a). The figure illustrates a mapping of points from the trajectory 
(8.30) with r = 3 into joint angles 6 l and 0 2 * The differences between the accurate 
inverse kinematics solutions and those computed by the network at the training 
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Figure 8.37 Inverse kinematlcs solution for circular trajectory, r = 3: (a) response of the 
trained network and (b) mapping error for training data in the test mode. [©IEEE; adapted from 
Arteaga-Bravo (1990) with permission.] 


points are depicted in Figure 8.37(b). The largest absolute error of the joint angle 
has been approximately 0.33° for d 2 at j8 = 55°. 

Neurons with unipolar activation functions have been used in this experi- 
ment. Training input data for both cases have been mapped to the range (0,1) 
in order to make more efficient use of the unipolar activation function and to 
obtain better results in training the networks. 


Comparison of Architectures for the 
Forward Kinematics Problem 

In this section, various feedforward neural networks are considered for solv- 
ing the forward kinematics problem for robots with two degrees of freedom. 
Several network configurations have been designed and compared for modeling 
of the manipulator kinematics (Nguyen, Patel, and Khorasani 1990). A manip- 
ulator is treated in the description below as a black box for which the training 
samples are collected from geometrical measurements. The specific trajectory of 
the end effector is used to test the outcome of the training for selected neural 
architectures. 

Each of the networks used for evaluation has two input nodes, 0 1 and 0 2 , in 
addition to the bias node, and two output nodes, x and y. Since the unnormalized 
output values are Cartesian coordinates and exceed the range (—1,1) covered by 
the bipolar sigmoidal-type activation functions, output neurons are selected such 
that they have an identity activation function and perform scalar product compu- 
tation only. Architectures studied have been: 2-10-10-2 for the network trained 
with the error back-propagation algorithm (BP), 2-10-10-2 for the network 
trained with the back-propagation algorithm but with output splitting (BPOS), 
2-8-2 for the functional link network (FL). In addition, a counterpropagation 
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(a) 

Figure 8.38a Architectures for a learning manipulators forward kinematics problem: (a) BP 
[©IEEE; adapted from Nguyen, Patel, and Khorasani (1990) with permission.] 


network (CP) has been used to provide trajectory mapping (Nguyen, Patel, and 
Khorasani 1990). 

The networks have been trained using a set consisting of 64 input /output 
pairs obtained from measurements taken in the robot working area. Weights have 
been initialized at random values between -0.5 and 0.5, except for output layer 
weights, which have been initialized at zero. The joint angle values for training 
have been selected in the first quadrant in such a way that they uniformly cover 
the entire work area of interest defined by 0 < 6 X < 90° and 0 < 0 2 < 90°. 

BP and BPOS error back-propagation architectures have been trained with 
7 ] = 0.1 and r] = 0.4 for the first and second hidden layers, respectively. The 
architecture of the BPOS network is the modified version of the common BP 
network. The nodes in both hidden layers of the BPOS network have been 
divided symmetrically into two groups. Both networks are shown in Figure 8.38. 
The FL network is composed of eight input nodes providing the following signals: 
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Figure 8.38b Architectures for a learning manipulator’s forward kinematics problem (continued): 
(b) BPOS. [©IEEE; adapted from Nguyen, Patel, and Khorasani (1990) with permission.] 


Øj, 0 2 , sin Øj, sin 0 2 , cos Øj, cos0 2 , sin (Øj + 0 2 ), and cos (Øj + 0 2 ). A simple delta 
rule was used for learning of the single-layer feedforward FL network with a 
learning rate of tj = 0.4. 

The CP network design has followed the guidelines of Chapter 7 with the 
Kohonen layer used as a first CP network layer. This part of the network layer 
has been trained using the winner-take-all learning rule. The layer is followed 
by the Grossberg layer producing the desired network output. The input to the 
first layer of the CP network is provided by normalized input vectors. The CP 
network has performed here as a look-up table responding with the output, which 
is paired with the stored input that matches the output by being in the same 
training pair of vectors. 

The following trajectory was used to train and test the discussed four archi- 
tectures 


Øl (0 = 3 
02 (0 = 3 


10 

l0~ 3 r 


(8.31) 
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Figure 8.39 Trajectory for nelwork testing. [©IEEE; adapted from Nguyen, Patel, and 
Khorasani (1990) with permission.] 


where O < t < 500 s. Figure 8.39 illustrates the trajectory used for network 
leaming in Cartesian coordinates. The FKE equation (8.26) provides here the 
results for the Cartesian coordinates of the end effector. 

The training of all four network architectures resulted in quick leaming 
convergence to the desired trajectory. The basic BP network with two hidden 
layers gave a reasonably accurate solution. The error defined as distance (in 
centimeters) between the trajectory (8.31) and the one produced by the trained 
network is shown in Figure 8.40(a). The training of the BPOS network, which 
involved considerably fewer connections than the BP network, yielded compara- 
ble accuracy. This is shown in Figure 8.40(b). Although BPOS network training 
required 300 training cycles compared to 200 cycles for the same error of 10 
percent for the BP network, its total training time has been shorter due to the 
smaller number of weights in the BPOS network. 

The best solution of the trajectory leaming task has been achieved with 
the FL network. The network required only 20 training cycles and the training 
resulted in the best approximation of the target trajectory (8.31) for all four cases 
studied. This can be seen from Figure 8.40(c). 

In contrast with the multilayer feedforward networks and functional link 
network, the CP network implements a steplike approximation of the trajectory. 




60.000 

Euclidian error distance (cm) 

0000 — , 




Figure 8.40 Trajectory error for the test mode: (a) BP, (b) BPOS, (c) FL, and (d) CP. [©IEEE; 
adapted from Nguyen, Patel, and Khorasani (1990) with permission.] 
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The training results in relatively fast approximation of the trajectory, but the end 
effector position error is rather large and uneven, as can be seen from Figure 
8.40(d). Therefore, the CP network in this application can be recommended only 
if a short development time is required for low-precision trajectory tracking. 

The discussion of neural network applications for solution of the forward and 
inverse kinematics problems and generation of the desired trajectories indicates 
that neural networks can efficiently leam manipulator kinematics. Modeling of 
the desired trajectories or kinematic transformations can be achieved with a con- 
siderable degree of accuracy using a number of network architectures. Training 
of a suitable neural network architecture can be performed based on geometrical 
measurements. Altematively, training can be based on computed solutions of ei- 
ther forward or inverse kinematics problem for the robot working area of interest 
to the designer. 


Target Position Learning 

So far our discussion of applications of neural networks for robotics has 
only focused on memoryless mapping of manipulator kinematics. Neural net- 
works, however, can also be employed to leam a sequence of mappings. This 
is understood as learning a sequence of Space transformations. In this sense, the 
learning of a target position can be understood as dynamical control learning. 
Robotic arms can be treated as dynamical plant. In addition to learning the trajec- 
tory or working area, which is essentially static, or point-to-point transformation, 
robot arms can be controlled to follow a sequence of steps and to reach a desired 
target position. In this section the study of robot arm movement with two degrees 
of freedom and with a desired final end effector position is discussed based on 
the work of Sobajic, Lu, and Pao (1988). In particular, the presence of trajectory 
disturbances is evaluated in this study. 

Our objective is to move the end effector of a two-link planar robot manip- 
ulator toward a target point until the positions of the two coincide. The specific 
task for the neural network designer is to generate the sequence of joint vector 
increments, thus providing the appropriate sequence of incremental control sig- 
nals. In this discussion we assume that the target end effector position that needs 
to be leamed is specified in polar coordinates r t /6 t . An error back-propagation 
network with architecture 4-6-4-2 has been applied to perform the task (Sobajic, 
Lu, and Pao 1988). Network input signals are equal to the currently measured 
joint angles 6 x {t) and 0 2 (0, and also to the fixed target coordinates r t and 6 t , The 
network needs to map angles 0 X and 0 2 into the two control actions k { A6 x and 
£ 2 A0 2 , respectively. The control State equations have the following form: 

øj(t + 1 ) = e x (f) + k 1 (t)A6 l 
Qiit + 1) = Ø 2 W *2«A0 2 


(8.32) 
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IBM PC/AT INTELLEDEX 


Figure 8.41 The communication scheme between the Inteliedex 605T and the neurocontroller. 
[©IEEE; reprinted from Sobajic, Lu, and Pao (1988) with permission.] 


The actual outputs of the neural network are k x (t) and k 2 {t). The values are limited 
to the range (—1,1) since bipolar activation functions are used. An Inteliedex 
605T-type robot and the neural network controller for that experiment have been 
configured in a closed-loop feedback system (Sobajic, Lu, and Pao 1988). As 
shown in Figure 8.41, the information exchange between the neurocontroller and 
the robot is carried by the RS232 serial port. 

Two experiments have been performed. In the first experiment, the neural 
network controller has been trained to guide the arm end effector from any initial 
location to the specified single fixed target position. The training has been imple- 
mented based on the set of 26 input vectors covering the anticipated work area 
and involving a number of different initial points. Network training results are 
tabulated in Figure 8.42(a). Eight different initial point positions have been used 
for training as shown in the leaming control experiment. The maximum relative 
target distance error reported in the experiment is 1.46 percent. All remaining 
relative target distance errors were below 1 percent. 

The second experiment has been designed to move the robot arm from any 
initial location to one of a number of specified target positions. The training 
has been performed using the set of 64 training input vectors for each of the 14 
specified target points being entrained. Results of the training have been tabulated 
in Figure 8.42(b) in 14 rows, with each row corresponding to one target point. 
All trials originated at 6 X = 0 2 — 0. The maximum relative target distance error 
reported was 2.36 percent. The final weight values of the trained network of size 
4-6-4-2 are listed in Figure 8.43. 

In addition to testing the trained network under normal working circum- 
stances as reported above, the controller was also investigated when affected by 
the presence of unforeseen disturbances. First, the network trajectory control was 
overridden first at time t = t u and then at time t = t 2 . Thus, the robot arm was 
deflected twice to undesirable positions off the trajectory. In each case, the neural 
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(fy, fy)— initial position of the robot arm. 

(r ( , fy)— target position. 




(fy fy)— final position of the robot arm. error— the relative error. 

(b) 

Figure 8.42 Result of the 4-6-4-2 network training: (a) fixed target position, different starting 
points, and (b) different target positions, different starting points. [©IEEE; reprinted from Sobajic, 
Lu, and Pao (1988) with permission.] 


controller has shown remarkable response and it produced a new and correct 
trajectory, as illustrated in Figure 8.44(a). In another experiment, the target posi- 
tion was altered at t = f, while the robot arm was already following the desired 
trajectory. As a result of the sudden alteration of the target, the neural controller 
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Figure 8.43 Tabuiated weights of the trained network for the different target points experiment. 
[©IEEE; reprinted from Sobajic, Lu, and Pao (1988) with permission.] 


generated the revised set of command actions, and the new target position was 
reached. This is illustrated in Figure 8.44(b). 

The described project with target position guidance control has made use 
of multilayer feedforward controllers trained using the error back-propagation 
techniques. The results achieved demonstrate their unusual efficiency and po- 
tential for use in robot kinematics control systems. The performance of neural 
controllers described in this section can be based entirely on actual geometrical 
conditions measured in the robot working space. Therefore, robot neurocontrollers 
can be developed without the need for complex mathematical transformations or 
even mathematical modeling of the problem. The discussed networks have been 
able to leam and generalize not only different static space-to-space mappings, but 
also the robot movement dynamics. The networks were also able to memorize the 
sequence of control signals needed to achieve one of the specified final positions. 
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Figure 8.44 The trajectory of the robot arm in the presence of disturbances: (a) the arm po- 
sition disturbance and (b) the target coordinates alteration. [©/£££; reprinted from Sobajic, Lu, 
and Pao (1988) with permission.] 


8.5 

CONNECTIONIST 
EXPERT SYSTEMS FOR 
MEDICAL DIAGNOSIS 

This section presents examples of connectionist expert systems for medical 
diagnosis. The introductory concepts of multilayer feedforward networks with 
error back-propagation training used for medical diagnosis have already been 
formulated in Section 4.6 and illustrated in Figure 4.24. This section discusses in 
more detail the specific capabilities and limitations of medical diagnosis connec- 
tionist expert systems. Example expert systems for diagnosis of skin, low back 
pain, and heart infarction diseases are presented. 

Medical diagnosis is performed with multiple, often related, disease States 
that can occur with identical symptoms and clinical data. In addition, specific 
diseases are not always accompanied by the same symptoms and clinical and 
historical data. As a result of these difficulties, both physician accuracy and 
computer-aided diagnosis systems leave room for improvement. This is one of 
the reasons for our study. Let us review the basic issues of computer-aided 
diagnosis. 

Computer-aided diagnosis systems have been successfully used in a number 
of areas of medicine. Among several approaches to automated diagnosis, rule- 
based systems and knowledge -based systems using the Bayesian approach seem 
to be the most widespread. Both approaches, however, seem to suffer from some 
problems in medical diagnosis applications. Rule-based systems typically require 
formulation of explicit diagnostic rules. Much of the human knowledge, and 
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especially expert or medical expert knowledge, however, remain implicit. Implicit 
knowledge can be expressed neither as a mathematical function for numerical 
computation nor as an if-then rule for symbolic processing. Even expressing 
the implicit knowledge as a conditional probability density function encounters 
difficulties because of statistical interdependencies between the events. 

Generally, implicit knowledge is difficult to quantize, formalize, or some- 
times even express verbally. In addition to these difficulties, translation of implicit 
knowledge into explicit rules would thus lead to loss and distortion of information 
content. In other words, domains with implicit knowledge are difficult to express 
in terms of accurate specifications of decision rules. In addition, the knowledge 
engineer would have to understand thoroughly the domain and the expert rea- 
soning in order to extract the rules for the system. Building an expert system for 
diagnosis using decision rules remains an expensive and time-consuming task. 

The approach based on Bayes theorem involves comparisons of relative 
likelihoods of different diagnoses. The individual likelihoods of diseases are 
computed from individual patient information. Symptoms of diseases have to be 
assumed to be statistically independent in the Bayesian approach. This assump- 
tion underlying the Bayes theorem, however, does not apply widely in diagnostic 
practice. This is because several symptoms often arise due to the same organic 
cause. On the other hånd, such symptoms have to be kept separated because they 
may provide vital decision information. Thus, a useful performance by practical 
diagnosis systems that are based on Bayes theorem is not always guaranteed. 

The connectionist expert system can be trained to associate diagnoses with 
the symptom data. This presents an alternative approach that is essentially data 
driven. The connectionist approach lies in the middle ground between the rule- 
based and Bayesian approach. The experimental and heuristic approach prevailing 
in building connectionist expert systems focuses on repeated presentation of train- 
ing examples of the various symptoms along with the diagnosed diseases. After 
training of a suitable network architecture is complete, test data and generalization 
by the network should lead to correct disease diagnosis. 


Expert System for Skin Diseases Diagnosis 

The expert system called DESKNET has been designed for instruction of 
medical students in the diagnosis of papulosquamous skin diseases. These are 
diseases that exhibit bumpiness or scaliness of the skin (Yoon et al. 1989). The 
system has been developed and used mostly for symptom gathering and for 
developing diagnosing skiils rather than for replacing the doctor’s diagnosis. The 
expert system uses multilayer feedforward network and employs the 96-20“ 10 
architecture with a single hidden layer and continuous unipolar neurons. 

Input for the system consists of the following skin disease symptoms and 
their parameters: location, distribution, shape, arrangement, pattem, number of 
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lesions, presence of an active border, amount of scale, elevation of papuls, color, 
altered pigmentation, itching, pustules, lymphadenopathy, palmar thickening, re- 
sults of microscopic examination, the presence of herald patch, and the result of 
the dermatology test called KOH. In addition to the current symptom-generated 
data from the above list, the duration of skin lesions in days and weeks are also 
represented at the input. 

The system has a total of 96 inputs plus the fixed bias input, and 10 out- 
put neurons. Inputs of values 0 and 1 are fed to the network signifying the 
absence or presence of the symptoms and their respective durations. The output 
neurons used in the local representation mode are indicative of the following 
10 diseases diagnosed: psoriasis, pityriasis rubra pilaris, lichen pianus, pityriasis 
rosea, tinea versicolor, dermatophytosis, cutaneous T-cell lymphoma, secondary 
syphilis, chronic contact dermatitis, and seborrheic dermatitis. 

The training data for the DESKNET system consisted of input specifications 
of 10 model diseases collected from 250 patients. If specific symptoms or their 
parameters were not known, the input was coded as 0.5. In this way the outputs 
of unipolar neurons used in this network were made independent of that input. 
The network was trained using the standard error back-propagation algorithm. 

The performance of the developed network was tested for previously unused 
symptom and disease data collected from 99 patients. The correct diagnosis was 
achieved for 70% of the papulosquamous group skin diseases. The success rate 
was above 80% for the remaining diseases except for psoriasis. Psoriasis patients 
were diagnosed correctly only in 30% of the cases. According to the domain 
expert, however, psoriasis often resembles other diseases within the papulosqua- 
mous group, which makes it somewhat difficult to recognize even for specialists. 

In case of a rule-based system whose knowledge base consists of a set of 
explicit rules, the decision path can be traced and explanation of the decision can 
be provided to the user. Connectionist expert systems reach conclusions through 
rather complex, nonlinear and simultaneous synergistic interaction of many units 
at a time. The accurate justification of an hypothesis, therefore, involves all nodes 
of the network. If the justification of a conclusion is expected from a discussed 
system, analyzing the effect of a single input, or selected group of inputs, would 
be very difficult and would yield inaccurate results. 

However, some explanation capability exists within a trained connectionist 
expert system. Despite its limitations, this capability can offer certain interpreta- 
tions to the user, and also reveals those input variables that are more important 
and are likely to contribute to decisions. For the local representation network, one 
of the output nodes is activated stronger than any of the remaining ones. Weights 
leading to this node with relatively large magnitudes are therefore contributing 
more to the decisions than the remaining weights, each considered separately. 
The positive and negative signs of relatively large weights can be looked at 
as indicative of positive and negative contributions, respectively. An additional 
consideration is that the hidden layer re-represents the input data and the outputs 
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of hidden nodes are neither symptoms nor diagnostic decisions. In a network 
with a single hidden layer, intemal data representation provided by the hidden 
layer rather than the original input data is used for generating the final decision. 
Intemal representation can, however, offer further heuristic interpretations of the 
knowledge processing within the system. 

In the case of the DESKNET expert system, which uses 20 hidden-layer 
nodes, the total of 20 hidden-layer factors attributed to the level of hidden node 
responses possess the discrimination capability for the set of 10 diseases. The 
factors are transmitted by large weights connected to the strongest activated 
output neuron of the output layer. With reference to Figure 4.7, such weights can 
be labeled w* i;i , . . . , , where k x is the output node number indicating 

the disease diagnosed, and there are / intemal factors supporting it. 

An example of the explanation potential of DESKNET is shown in Figure 
8.45. The figure shows only the weights that are of relative importance for 
diagnosing a dermatophytosis disease. It can be seen that, although unnamed, 
intemal factors numbered 13 and 14 are the two strongest in causing disease 
number 5. Thus, for this example case, we have k x = 5, j x = 13, j 2 — 14, 
and / = 2. Relatively large input weights now need to be found for each of the 
hidden nodes j\ 9 j 2 , . j/. Input nodes 1, 10, and 36 and 6, 10, 36, and 71 were 
identified as those that are affecting the intemal factors of 13 and 14, respectively. 
Consequently, symptoms and their parameters numbered 1, 6, 10, 36, and 71 were 
found to be of dominant importance for the diagnosis of dermatophytosis. 

Below we explain the methodology that has been used for Identification 
of relatively large and contributive weights and symptoms in DESKNET. To 
determine intemal decision factors, the largest and smallest weights of the output 
layer have been selected among all weights j = 1, 2, ...,/, for a given 
diagnosis k v The maximum difference between the largest and smallest weight 
constitutes the weight range. The range is then divided by 6 to estimate the 
standard deviation of weights. The doubled standard deviation is used as the 
cutoff point for discrimination of relatively large weights. For the data of Figure 
8.45 with diagnosed disease 5, or k x — 5, weights with the largest magnitudes 
are w 5 13 = 2.86, w 5 U = 2.71, w 52 = -2.68, w 56 = —3.46, w 5 10 = -2.38, 
and w 5 17 = -3.31. The range for this set of weights is 

w 5,13 “ w 5,6 = 6.32 

This yields the standard deviation of approximately 1.05, and the cutoff points 
equal to ±2(1.05) = ±2.1 for selection of relatively large weights. Therefore, 
only two weights, w 5 X3 and w 5 14 , are found to support the decision of node 
k x = 5. The weights are shown in the figure and other weights are omitted. This 
analysis performed for the output layer needs to be repeated by using hidden 
nodes 13 and 14 as decision nodes. The full probability profile obtained for the 
diagnosis path of dermatophytosis is illustrated in the figure. 

The discussion indicates the presence of the following symptoms and their 
parameters for this particular disease: lesions of weeks of duration, minimal 
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Figure 8.45 Explanation of dermatophytosis diagnosis using the DESKNET expert system. 
[Adapted from Journal of Neural NetWork Computing, New York: Auerbach Publishers, © Warren, 
Gorham & Lamont, Inc., used with permission, from Yoon et al. (1989).] 


itching, positive KOH test, lesions on feet, minimal increase in pigmentation, and 
microscopic evaluation for pseudohyphae. The profile produced by this methodol- 
ogy rather closely matches the probability profile provided by the domain expert. 
Similar probability profiles can be obtained for the negative responses of the bid- 
den nodes numbered 2, 6, 10, and 17 (not shown on Figure 8.45). These profiles 
indicate that absence of certain symptoms contribute to the positive diagnosis of 
a particular disease. 

It should be noted that the discussed method of explanation extraction from 
the expert system is rather heuristic and somewhat inconclusive. Indeed, factors 
influencing the decision are usually distributed within the network in a complex. 
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superimposed way. Thus, only fairly general and somewhat tentative conclusions 
can be drawn from the search for the largest and smallest weights connected to 
the neurons with the strongest responses within the layer. 


Expert System for 
Low Back Pain Diagnosis 

The neural network-based expert system described in this section is used 
for low back pain disease diagnosis (Bounds et al. 1988). This expert system 
is somewhat better validated in practice and less experimental than DESKNET. 
After training, the system is able to provide correct diagnosis more often than 
any of the three groups of doctors or the fuzzy logic system when they are 
used for comparisons on the same data. This project offers an interesting and 
suggestive example of the useful performance of a neural network in an actual 
diagnosis application. 

Back pain is one of the most common complaints encountered by doctors, 
and yet the diagnosis of the cause underlying the back pain is difficult. Common 
symptoms are often found in patients with serious spinal problems and in those 
with much less serious problems. In addition, nonorganic symptoms make the 
diagnosis of this disease more difficult. 

For the purpose of patient treatment the back pain needs to be classified 
into four categories as follows (Bounds et al. 1988): 

■ Simple low back pain (SLBP) — simple backache 

■ Root pain (ROOTP) — nerve root compression due, for example, to disk 
prolapse 

■ Spinal pathology (SPATH) — due to tumor, inflammation, or infection 

■ Abnormal illness behavior (AIB) — back pain with significant psycho- 
logical overlay. 

Patients with AIB need to be separated from the population due to the difficulties 
of interpretation of clinical tindings in patients of this group. Further compre- 
hensive assessment is needed for such patients before a more detailed diagnosis 
is provided. 

The set of patient data used referred to 200 patients with back pain. Lists 
of symptoms and medical history data were compiled for each patient for the 
purpose of producing the training and test sets. To ensure that the patients’ prob- 
lems were truly representative of one of the four classes, patients were followed 
up over a long period. Fifty examples of each class were used; 25 formed the 
training set and the remaining formed the test set for each class. 

The input data were coded into 50 inputs in the unipolar values range [0,1]. 
Many of the symptoms and medical history questions were responded to with 
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“yes/no” answers and they were assigned binary 1/0 input values, respectively. 
The input was coded 0.5 where none of the answers was applicable. Answers to 
a few questions that were not binary were coded with the input values between 
0 and 1. Unipolar continuous neurons arranged first in a single- and then in 
a double-hidden layer architecture were used for experiments. Both local and 
distributed representations were tested. Although local representation with a sep- 
arate output neuron for each class worked reasonably well, better results were 
obtained for distributed coding as folio ws: 


1 [0 0] f forSLBP 
[0 1]' for AIB 

[1 0]' for ROOTP 

[1 l]' for SPATH 


This coding improved the chances of a less serious cause (SLBP) being 
incorrectly attributed as the most serious cause (SPATH), and vice versa. 

The training was implemented for the leaming rate of 0.1 and with the mo- 
mentum term of 0.9. Somewhere between 10 and 40 training runs were carried 
out, each starting from different initial random weights for each architectural 
configuration tested. A different number of hidden neurons were tried, ranging 
from 0 to 50 in a single hidden layer. Also, various sizes of double hidden lay- 
ers were evaluated. Subsequent to training, feasible network architectures were 
selected. Final tests were made on the test set with a thresholding value of 0.5 
for the output neurons so that they provided binary decisions. All units of the 
trained networks were fully connected. 

The results were found to be somewhat insensitive for the number of single 
hidden layer units if it was selected between 0 and 50. The average success rate 
for test runs is depicted in Figure 8.46. Ten runs for each tested architecture are 
included in the average shown. Markedly, the large dimensionality of the input 
vector caused only a marginally worse decision rate for a single-layer network 
(/ = 0) compared to a single hidden-layer network containing dozens of hidden 
neurons. Networks with double hidden layer have shown no practical advantage 
over the results illustrated in the figure for a single hidden layer. This conclusion, 
however, remains valid only for this particular problem, network input /output 
size, and the data used to tr^in it. 

To evaluate the performance of the developed expert system, the set of test 
data on symptoms and medical history of patients was given to both general 
practitioners and specialists. In addition, the set was used for testing a CAD 
system based on fuzzy logic (Bounds et al. 1988; Norris 1986). Comparison of 
performance has been made for the following diagnosis methods: 


■ BNS — Bristol neurosurgeons 

■ GOS — Glasgow orthopaedic surgeons 


534 


APPLICATIONS OF NEURAL ALGORITHMS AND SYSTEMS 


Percentage ^ 

test pattems 
correct 84 - 

k 

82- 


80- 

• • 

• 

78- 

• 

• • * 

• 

• 

76- 

i 

74- 



~™ 1 1 1 1 1 1 1 1 1 1 ^ 

5 10 15 20 25 30 35 40 45 50 

Number of hidden units 

Figure 8.46 Test set performance of the back pain diagnosis expert system using a single 
hidden layer network with J hidden nodes. [©IEEE; adapted from Bounds et al. (1988) with 


permission.] 


■ BGP — Bristol general practitioners 

■ FLS — Fuzzy logic system (CAD) 

■ MLP1 — Best multilayer perceptron with architecture 50-30-2 

■ MLP2 — Mean of 10 MLP runs for 50-30-2 network 

■ MLP3 — Best multilayer perceptron with architecture 50-0-2 

■ MLP4 — Mean of 10 MLP runs for 50-0-2 network. 

The results for each of the four diagnoses are summarized as bar charts in 
Figure 8.47. The results indicate that the overall performance of the multilayer 
perceptron network (black bars) exceeds that of three groups of doctors (blank 
bars). Also, the multilayer perceptron network reaches the level of performance 
of the fuzzy logic system (shaded bar). For the most serious disease, spinal 
pathology, the neural expert system outperforms the other methods as shown in 
Figure 8.47(d). It can be seen that any of the four selected architectures yields 
an equal or better diagnosis than each of the three teams of doctors and the 
fuzzy logic system. This outcome of the diagnosis is of special importance since 
undetected spinal pathology may have serious implications for a patient’s health. 
On the other hånd, the neural expert system performed much worse than any other 
method for diagnosis of simple low-back pain as illustrated in Figure 8.47(a). 
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Figure 8.47a, b Comparison of diagnostic methods for back-pain diagnosis: (a) simple low 
back pain and (b) abnormal illness behavior. [©IEEE; reprinted from Bounds et al. (1988) with 
permission.] 


In the opinion of the authors of the project, who are pattem recognition spe- 
cialists and medical school faculty, the neural network could perform even better 
than reported provided a larger number of representative training examples were 
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Figure 8.47c, d Comparison of diagnostic methods for back-pain diagnosis (continued): (c) root 
pain, and (d) spinal pathology. [©IEEE; reprinted from Bounds et al. (1988) with permission.] 


used. Insufficient examples which do not spån the entire Spectrum of possible 
input pattems may have caused certain features of pattems not to be encoded in 
the network during the reported training experiment. 
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Expert System for Coronary 
Occlusion Diagnosis 

Acute myocardial infarction, also called coronary occlusion, is an example 
of another serious disease that has been difficult to diagnose accurately. A con- 
siderable number of methodologies have been developed in attempts to improve 
on the diagnosis accuracy of predicting acute myocardial infarction. The best of 
the diagnostic approaches using conventional computer-aided diagnostic systems 
has performed with a detection rate of 88 percent and a false alarm rate of 26 
percent. These prediction rates are comparable to those by physicians, reported 
as equal to 88 and 29 percent for detection and false alarm rates, respectively 
(Baxt 1990). 

A multilayer unipolar perceptron network trained using the error back- 
propagation technique was applied to the prediction of acute myocardial infarction 
in patients admitted to the emergency room with acute anterior chest pain. Be- 
cause most patients in this population are not suffering from acute myocardial 
infarction, a subset of patients with a much greater probability of the disease was 
chosen for the study. Moreover, only patients later admitted to the coronary care 
unit were included in the evaluation. Understandably, only for these patients, the 
full present and retrospective data were available for analysis (Baxt 1990) and 
subsequent neural network training. 

Consider that the present coronary care unit data confirm or rule out the 
presence of infarction. These data were presented as desired output and coded as 
0 or 1 for an absent or present infarction, respectively. The retrospective data were 
clinical variables collected in the emergency department and submitted as input 
pattems to the network. The full list of predictive input variables for the diagnosis 
of coronary occlusion may include as many as 41 items. The tests performed 
on the neural network indicated, however, that only 20 of these variables were 
relevant for representing essential retrospective diagnostic data. The variables 
are tabulated in Figure 8.48(a). The inputs to the network were generated by a 
specially written program that coded most of these clinical variables into binary 
data 0 and 1 for the absence or presence of a tinding. 

The multilayer feedforward perceptron network that yielded the best results 
in the project has 20 input nodes, 1 output node, and two hidden layers consisting 
of 10 neurons each. The data involved a population of 356 patients randomly 
divided into two equal and permanent groups. In each of the groups there were 
118 patients without acute myocardial infarction and 60 patients who sustained 
the disease. The network training was simulated starting at random initial weights 
with a leaming rate of 0.05 and momentum term of 0.9. 

The data on the first group of patients were first used to train the network, 
while the data on the second group were used to perform a test of the diagnostic 
expert system. It was found during tests that the output neuron indicating the 
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Figure 8.48 Connectionist expert system for coronary occlusion diagnosis: (a) input variables 

and (b) test data performance summary. [Adapted from Baxt (1990)] 



disease always exceeded 0.8, and the output indicating its absence was below 
0.2. The network correctly diagnosed 55 of 60 test patients with infarction and 
113 of 118 test patients without infarction. Secondly, the training and test sets 
were swapped and the training and test sessions repeated. The newly obtained 
network correctly diagnosed 56 of 60 test patients with infarction and 113 of 118 
test patients without infarction on the new test data. 

Figure 8.48(b) summarizes the network performance achieved in this project. 
The analysis of results reveals that the connectionist expert system has a disease 
detection ratio above 92 percent among sick patients. However, among healthy 
patients the system provides false alarms of disease in 4.3 percent of the cases. In 
spite of the errors, these results outperform the best previously reported diagnostic 
schemes, including trained physicians, which yield results of 88 and 26 percent, 
respectively. 

Although the results presented above are very encouraging, further study 
needs to be made. Specifically, to draw conclusive comparisons, the experiments 
with different diagnosing techniques would have to use the same set of clinical 
and diagnostic data. Further, the study involved only patients admitted from the 
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emergency department to the coronary care unit, which could have had conse- 
quences in terms of disease development. Despite the preliminary scope of the 
study, if the reported results hold up to further scrutiny, the connectionist expert 
system for infarction diagnosis could have a substantial impact on the reduction 
of health care costs. 


Concluding Remarks 

The three medical diagnosis expert systems presented in this section are 
able to store the model patient knowledge base. Moreover, the connectionist 
methodology to store knowledge can substantially ease the task of budding an 
expert system. It is no longer necessary for a knowledge engineer to have a 
thorough understanding of implicit rules, links, and relations within the domain. 
Moreover, the amount of time for system development is substantially reduced 
compared to the systems that use either a tree structure of rule-based relationships 
or statistical calculations. 

Although the performance achieved by the example expert systems discussed 
is impressive, the systems built and tested should be considered to be prelimi- 
nary and tentative (Jones 1991). Although they compare rather advantageously 
and sometimes are even clearly superior to human experts, more detailed and 
exhaustive study is needed. Such a study would allow better assessment of all 
capabilities and limitations of connectionist methodology for expert systems ap- 
plications. Also, a more comprehensive and versatile approach to the retrieval of 
the decision explanations of such expert systems is needed. It is almost certain 
that these and other relevant questions shall be explored by knowledge scientists 
in the near future. 


8.6 — — 

SELF-ORGANIZING 
SEMANTIC MAPS 

The semantic maps discussed in this section are implemented based on 
the self-organizing feature map concept described in Sections 7.4 and 7.5. This 
application example demonstrates that, in addition to processing quantitative data 
for feature extraction, the maps make it possible to display abstract data, such 
as words, anranged in topological neighborhoods. The maps extract semantic 
relationships that exist within the set of language data considered here as a 
collection of words. The relationships can be reflected by their relative distances 
on the map containing words positioned according to their meaning or context. 
This indicates that the trained network can possibly detect the logical similarity 
between words from the statistics of the contexts in which they are used. The 
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context is understood here as an element of a set of attribute values that occur in 
conjunction with the words. In another approach, the context can be determined 
by the frequency of neighboring words without regard to the attribute values 
that are occurring. The discussion below is based on experiments and results 
described by Ritter and Kohonen (1989) and Kohonen (1990). 

The self-organizing feature network consists of a number of laterally inter- 
acting adaptive neurons, usually arranged as a two-dimensional array. Although 
each array neuron is connected to each input node, only neighborhood neurons re- 
spond, after training, to specific inputs. The planar position of the excitation spec- 
ifies a mapping of the input pattem onto the two-dimensional topographic map. 
The map represents the distance relations within the original high-dimensional 
input Space of input pattems compressed to the two-dimensional image space 
represented by a neural array. This remarkable property follows from the as- 
sumption of lateral interactions within the array and from the assumed weight 
adaptation law (7.28). 

Let us restate that the resulting maps are nonlinear projections of the input 
space onto the feature array. The projection preserves the distance relationships 
between the pattems in the original multidimensional pattem space. Such implicit 
distances become explicit planar distances in low-dimensional image space called 
here the feature space. However, a mapping from a high-dimensional space to a 
low-dimensional one preserves only the most important neighborhood relation- 
ships within the input pattem. The less relevant properties are ignored. Moreover, 
the inputs that appear more frequently are mapped to larger and stronger domains 
at the expense of less frequent inputs. Data that are close, or similar, in the input 
space and appear frequently at the input are mapped into a cluster of images 
forming localized images on the map. Furthermore, the clusters of images are 
also arranged into topological neighborhoods that display the overall relationships 
within the entire input data set. 

Extraction of features from geometrically or physically related input pattem 
data is usually a concrete task, as demonstrated in example of Figure 7.19. This 
and other self-organizing feature map examples of Chapter 7 have made use of 
the numerical values of data in the input space. Much more abstract data need to 
be processed, for example, in the case of cognitive and linguistic relations that 
may exist in the input space. Such processing reflects relationships that neural 
networks could extract for linguistic representations and relations. Self-organizing 
maps that display semantic relations between abstract language data are called 
semantic maps (Ritter and Kohonen 1989). 

The most general concepts or abstractions needed to interpret the empirical 
world are called categories. They form an elementary lexicon embracing the 
whole domain of knowledge. They are commonly used for thinking and com- 
munication. The most common categories are (1) items (objects), (2) qualities 
(properties), (3) States (or State changes), and (4) relations (spatial, temporal, or 
other). The categories are formed on the basis of consciousness, and their typical 
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counterparts in most languages are (1) nouns, (2) adjectives, (3) verbs, and (4) 
adverbs, prepositions, etc., but also order of words (syntax). Each category usu- 
ally includes many subcategories such as persons, animals, and inanimate objects 
that are usually found within each category of items. 

Difficulty results when trying to express the human language as neural net- 
work input data as opposed to nonsymbolic input data encoding. These data can 
be represented as continuous or discrete values arranged into a vectorial variable. 
The difficulty is due to the faet that for symbolic objects, such as words, no ad- 
equate metric has been developed. In addition, except for a few special words 
denoting sounds, the meaning of a word is usually disassociated from its encoding 
into letters or sounds. Moreover, no metric relations exist whatsoever between 
the words representing similar objects. To overcome the difficulties mentioned, 
we will therefore try to present the coded value for each symbol word expressed 
in appropriate context containing values of attributes. 

One simple model for contextual representation of a symbol and its attribute 
enabling the topographical mapping on the semantic map is to concatenate the 
data vector as follows: 



where the symbol and the attribute part of the data vector are expressed as 
column vectors [x s 0]* and [0 x a ]\ respectively. 

We assume here that the symbol and attribute vectors are orthogonal. To 
reflect the metric relationship of the attribute sets, the norm, or the vector length, 
of the attribute vector is chosen as dominant over the norm of the symbol vec- 
tor during the self-organizing process. Since both symbol and attribute part are 
included in each training input, however, topological neighborhoods are created 
by symbols and their attributes. The symbols can become encoded into a planar 
array so as to reflect their logic similarities. We would expect that the recall of 
only the symbol part with the attribute part of the data vector missing would 
provide us with the calibrated topological map of symbols. Such a map would 
show implicit logical similarities that potentially exist within the set of symbols 
in the form of geometrically explicit distances between symbols. 

Assuming the local representation of the symbols, such that the fth symbol 
in the set of P symbols is assigned a ^-dimensional symbol vector x si whose 
fth component is equal to a fixed value of c, and all remaining components are 
zero. We thus have the symbol vectors of value 


Xji = [c 0 .. 

• 0]' 


O 

II 

** 

• o]' 

(8.34) 

o 

o 

II 

- 



The attributes are present or absent for each of the P symbols. The attribute 
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Figure 8.49 Tabie of attributes for the set of P = 16 animal objects. [© Springer-Verlag ; 
reprinted from Ritter and Kohonen (1989) with permission.] 


column vector x a contains the number of entries equal to the number of relevant 
attributes aliocated to the set of symbols under consideration. The presence, or 
absence, of a particular attribute is indicated by a binary entry 1 or 0, respectively. 
Note also that the unnormalized similarity measure of two symbols is the scalar 
product of their respective attribute vectors. 

Figure 8.49 illustrates a specific example input data case used for simulation 
by Ritter and Kohonen (1989). For P = 16 elements in the set of symbols being 
animals, the attribute vectors are specified by the attribute column vectors of the 
table shown. An example data vector for a cow is given as 

x=[x, 16 0 0 1 0 1 1 1 0 ... 0]' (8.35) 

where x 5l6 - [0 ... c ]* and x consists of n = 29 entries. 

Let us note that the norm of the difference vector between each pair of 
symbol vectors is identical and equal to y/2c. The numerical value of c is the 
norm of each symbol vector and it determines the relative weight of the symbol 
vector as compared to its attribute counterpart vector within the 29-entry data 
field. Input data vectors shown in the figure and used for training have been 
normalized to the unity length. To draw a comparison, this may correspond to 
the biological intensity normalization of the incoming stimuli. 

Sixteen data vectors containing a symbol (object) part concatenated with its 
attribute part as display ed in Figure 8.49 have been used to produce a semantic 
map of the set of animals. The data vectors have been used to train a 10 X 
10 planar array of neurons. The initial weights have been selected of random 
values so that no initial ordering was imposed. After about 2000 training input 
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presentations, the neighborhood sensitivity has been ascertained. The neurons’ 
responses became consistently stronger or weaker in certain regions. This has 
been leamed due to either of the excitations x = [x si 0]*, or x = [0 x ai ] 1 . 
The neurons with the strongest response due to the symbol vector excitation 
only (attribute part missing) are shown in Figure 8.50(a). They are labeled using 
the symbol that elicits the particular response. The neurons represented by dots 
indicate their nondominant responses on the map of symbols. 

The spatial order of the responses has emerged on the semantic map as a 
result of the training. The order divides the symbols into three distinet groups of 
birds (left), hunters (right), and grass-eating animals (top). Within each group, 
certain similarities are also recognizable. “Dog” is located predominantly be- 
tween “fox” and “wolf.” It is somewhat less related to “cat” and “lion” and even 
less similar to hunting birds, however, it does not seem to be related at all to 
nonhunting birds such as a “duck” or “goose.” This is also visible from Figure 
8.50(b) showing the strongest response domains. Each neuron of this map is 
marked with the stimulus eliciting its strongest response. 

The semantic maps of objects displaying logical distances between the an- 
imal symbols have been produced in this experiment using the static context 
of the relevant attribute vector. In language use, however, objects and their at- 
tributes often occur in temporal sequences. For a more comprehensive evaluation 
of the potential of self-organizing semantic maps, the concept of context needs 
to be broadened and to include the time domain. In the discussion to follow, the 
time variable present in semantic data will be represented as an order of word 
occurrence in their serial sequence. 

Let us look at a mapping of a set of 30 test words listed as shown in 
Figure 8.51(a). The test set consists of nouns, verbs, and adverbs. The set is 
subdivided into 14 pairs or triples of related words. Let us attempt to build 
sensible sentence pattems on this set. Three-word sentences created from this 
semantic map-building lexicon and allowed in this experiment have to obey 
some rudimentary rules of grammar and semantic correctness. 

Thirty-nine basic sentence pattems as shown in Figure 8.5 l(b) have been 
produced based on such rules. Test sentences are then created by plugging each 
member of the pair, or the triple, into the sentence pattems. In this mode, the basic 
sentence pattem 1-5-12, or “Bob runs fast” generates test sentences “Jim runs 
fast,” “Mary runs fast,” but also “Bob walks fast,” etc. This way the total of 498 
grammatically and semantically correct test sentences as in Figure 8.51 (c) have 
been produced. The temporal context of a word in the experiment is made by the 
immediate predecessor and successor word. To build pairs for the first and the 
last word in the sentence, the randomly generated sentences were concatenated 
in the order in which they were produced. 

Each word from the 30-word dictionary involved in the experiment can be 
defined by a vector with 30 entries with only one of them having a nonzero 
value. This would lead, however, to a rather sparse description of the input data. 
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Figure 8.50 Semantic self-organizing symbol map obtained through training using attributes: 
(a) strongest responses due to the symbol part only excitation and (b) strongest response 
domains. [©Springer-Verlag; reprinted from Ritter and Kohonen (1989) with permission.] 
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Mary likes meat 
Jim speaks well 
Mary likes Jim 
Jim eats often 
Mary buys meat 
dog drinks fast 
horse hates meat 
Jim eats seldom 
Bob buys meat 
cat walks slowly 
Jim eats bread 
cat hates Jim 
Bob selis beer 
(etc.) 


(b) 


(c) 


Figure 8.51 Words and sentences used for context production: (a) nouns, verbs, and adverbs, 
(b) sentence patterns, and (c) selected examples of sentences. [©Springer-Veriag; reprinted from 
Ritter and Kohonen (1989) with permission.] 


For more economical encoding, a 7-tuple vector of unit length has been used 
(Ritter and Kohonen 1989). Such 7-tuple vectors have been generated at random 
for each word in the experiment, thus ensuring quasi-orthogonality of different 
word vectors. It follows then that the word in the context can be expressed using 
a single vector of 21 entries, where 14 entries represent the predecessor and suc- 
cessor context words. To perform training more efficiently, the average context 
of each word under consideration has been used instead of step-by-step leaming 
with each sentence applied separately. Ten thousand sentences having all vectors 
have been used for averaging. The average word contexts were represented as 
14 entries with word vectors scaled to unit length. They have been given a role 
similar to the attribute vector x a described in the preceding simulation. 

Each of the 7-tuple symbol-word vectors was scaled to the length c = 0.2 
and combined with the 14-tuple average context vector. As a result, 21 scalar 
input data have been supplied as inputs to the array. A planar array of 10 X 15 
neurons with 21 input nodes and weights randomly initialized prior to the train- 
ing has been used for this semantic map design. Each of the 21 inputs has 
been connected to each of 150 neurons. After 2000 training presentations, the 
responses to symbol words alone have been tested. As can be seen from Figure 
8.52(a), the training contexts have positioned the words in such an order that their 
mutual arrangements reflect existing semantic and grammatical relationships. A 
separate grouping for nouns, verbs, and adverbs has resulted, as marked by the 
partitioning curves. Each of the three domains clustering parts of speech has 
discemible subregions. For instance, names of persons or adverbs of opposite 
meaning tend to cluster together. 
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Figure 8.52 Semantic maps of words obtained by using their mutual context for training: 
(a) preceding-succeeding word context and (b) preceding word context. [©Springer-Verlag; 
reprinted from Ritter and Kohonen (1989) with permission.] 


In a related semantic map generation experiment, the word context was 
limited to the immediately preceding word only. Even with that limited context, 
a map showing similar distance relationships has resulted as shown in Figure 
8.52(b). It can be seen that even with the restricted context, a meaningful se- 
mantic map can be produced. 

These feature extraction experiments have shown the extension of self- 
organizing neural arrays to processing of expressions that are of symbolic nature 
and display rather subtle and even evasive relationships. The simulations of 
semantic structures have resulted in semantic neighborhood maps that encode 
words into relative planar regions of localized responses. As seen, generated 
semantic maps allow the segregation and grouping of symbolic data based on 
an unsupervised leaming approach. The topographic organization of concepts 
as discussed above may be offering us some clues as to why the information 
processing in the brain is spatially organized and distributed. 


8.7 

CONCLUDING REMARKS 

The presentation focus of this chapter has been on neurocomputing engineer- 
ing projects. Our concem has been analyzing and designing sample information 
processing systems that are nonprogrammable. We followed a fundamentally 
new and different information processing paradigm that is based on developing 
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associations, mapping, or transformations between data in response to the en- 
vironment. To achieve these objectives, we essentially pursued nonalgorithmic 
paths for solving specific engineering problems. 

Well above a dozen representative engineering application problems have 
been explored in this chapter. Application examples have been selected so that 
they reinforce and illustrate basic theoretical concepts discussed in preceding 
chapters. Projects that seem exciting but are not easily traceable and reproducible 
have been excluded from coverage. Networks for constrained optimization, for 
printed and handwritten character recognition, for controlling static and dynam- 
ical plants and actuating robot arms, connectionist expert systems for medical 
diagnosis, and finally networks for processing semantic information have been 
studied. It is believed that the depth and diversity of the coverage in this chapter 
has prepared the reader at least for pursuing new application-specific projects, if 
not for independent research. 

Due to the abundance and diversity of technical publications on artificial 
neural systems applications, many other interesting problem solutions could not 
be addressed in this book. A number of edited volumes on artificial neural 
systems (Denker 1986; Anderson and Rosenfeld 1988; Anderson 1988; Touret- 
zky 1989, 1990; Omidvar 1992) and special issues of technical journals (IEEE 
Computer ; March 1988; IEEE Transactions on Circuits and Systems, June 1990; 
Proceedings of the IEEE, September 1990 and October 1990) include discussion 
of different applications. In addition, the reader may refer to specific applications 
from the following partial list of projects: 

■ Trainable networks for solving matrix algebra problems (Wang and 
Mendel 1990) 

■ Reinforcement and temporal-difference leaming controller for an 
inverted pendulum problem (Anderson 1989) 

■ Truck “backer-upper” neurocontroller and control of complex dynamical 
plant (Nguyen and Widrow 1990) 

■ Trajectory generation based on visual perception of mobile robots 
(Kawato et al. 1988) 

■ Sensor-based control of robots with vision (Miller 1989; Hashimoto et 
al, 1990) 

■ Networks for robot control (Liu, Iberall, and Bekey 1989; Wilhelmsen 
and Cotter 1990) 

■ Autonomous land vehicle neurocontroller /driver (Pomerleau 1989) 

■ Neural system NETTALK for continuous speech recognition (Sejnowski 
and Rosenberg 1987) 

■ General review of neural network performance for speech analysis 
(Lippman 1989) 

■ Temporal signals recognition networks (Gorman and Sejnowski 1988; 
Eberhart and Dobbins 1990) 
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k Inspection of machined surfaces (Villalobos and Gruber 1990) 

■ Connectionist expert system for bond rating (Dutta and Sekhar 1988). 

Typical applications of neuroeomputing indicate the ability of neural net- 
works to handle multiple input and output variables, nonlinear relationships, 
past States, and implement fast network leaming when suitable architectures are 
used. It is important to realize that these are not special, magical, or stand- 
alone techniques and applications. Rather, they can be looked at as outgrowth 
of long periods of research in function approximation, optimization, signal pro- 
cessing, pattem classification and decision theory, and stochastic approximation 
(Anderson 1989; Poggio and Girosi 1990; Tsypkin 1973). What makes neural 
network approaches different from conventional engineering is the methodology, 
specifically, the experimental methodology of leaming and problem formulation 
that underlies neural network methods. 

It seems somewhat easy to perceive the free-wheeling nature of many ap- 
proaches we have taken during our discussion and the absence of theoretical 
guarantees, derivations, and proofs. Such approaches have often been present in 
earlier theoretical presentations and perhaps even more in the practical applica- 
tions of artificial neural systems addressed in this chapter. However, we have 
made no restrictive assumptions or simplifications such as linearization, conver- 
gence, and stability. Interestingly, they have not been needed to accomplish a 
number of successful artificial neural system designs (Barto 1990). 

As time passes, new neuroeomputing engineering methods and applications 
will undoubtedly emerge. Despite spectacular successes, one should not be mis- 
led, however, about the still existing limitations of neural network technology. 
Currently neural networks can already be of great value in helping to solve 
real problems without exaetly mimicking many funetions performed by biolog- 
ical neural networks. Animals and humans can attain complex combinations 
of information processing, pattem recognition and feedback, and feedforward 
transmissions and of cognitive acts. While these very complex facets of neuro- 
computing may be considered as eventual aspirations for artificial neural systems, 
they should not be used as criterion for their present value. 


PROBLEMS 

Piease note that problems highlighted with an asterisk (*) are typically com- 
putationally intensive and the use of programs is advisable to solve them. 

P8J The cost function for a linear programming problem is c - v { + 2v 2 . The 
constraints are known as 

- v 2 < 2 
2v l + 3v 2 ^ 12 
v 1 ^ 0 
v 2 — 0 
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Figure P8.2 Linear programming problem of Problem P8.2. 


(a) Find the solution of the problem by analyzing the conditions on the 
solution plane. 

(b) Find the weights and currents of the gradient-type neural network 
similar to that shown in Figure 8.1 that would solve this task. 

P8.2 For the linear programming problem with the cost function c = 2v 1 + v 2 
and the constraints as shown in Figure P8.2, find 

(a) the solution of the linear programming problem. 

(b) the weights and currents of the gradient-type neural network similar 
to that shown in Figure 8.1 that would solve this task. 

P8.3* The static plant shown in Figure P8.3 needs to convert Cartesian coordi- 
nates x, y 9 z, into cylindrical coordinates p, cf>, z. The relations between 
input and output variables are also shown in the figure. Design a mul- 
tilayer feedforward network that performs the forward identification and 
models the plant in the forward mode as shown in Figure 8.17(a). The 
training test points should be selected for the plant outputs in the range 
0 < p < 1,0< 4> < tt/ 2, and 0 < z < 1. Select suitable architecture for 
the multilayer feedforward network and train the network. 

P8A* Repeat Problem P8.3 for the plant inverse identification by modeling the 
plant, which converts the Cartesian x, y, z coordinates into cylindrical 
coordinates p, <£, z in the same working region. [Hint: Use Figure 8.17(b) 
and model the inverse of the plant by training a multilayer feedforward 
network of suitable architecture. Study the mapping accuracy produced by 
the network by computing the differences between the network response 
and the accurate response value.] 
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Cartesian / Cylindrical Coordinates 


x,y,z I p, <(), z 
jc=pcos<|>, y=psin<|>, z—z 

p=V?+7, sm^=-j== 
(b) 


Figure P8.3 Conversion of Cartesian to cylindrical coordinates: (a) plant and (b) notation and 
relationships. 


P8.5* The static plant shown in Figure P8.5 converts Cartesian coordinates x, 
y, z, into spherical coordinates r, $, 0. The relations between the input 
and output variables are also shown in the figure. Design a multilayer 
feedforward network that performs the forward plant identification and 
models it in the forward mode as shown in Figure 8.17(a). The plant out- 
puts, or training test points, should be selected in the range: 0 < r < 1, 
0 < <f> < 7t/ 2, and 0 < 0 < 7r/2. 

P8.6* Repeat Problem P8.5 for plant inverse identification for the plant convert- 
ing the Cartesian x , y, z coordinates into spherical ones, r, <£, 0, in the 
same working region. [Hint: Use Figure 8.17(b) and model the inverse of 
the plant by training a multilayer feedforward network of suitable archi- 
tecture.] 

P8.7 Network A needs to be trained to perform as plant inverse with its copy 
serving as neurocontroller B. The leaming control system is shown in 
Figure P8.7(a). The expressions describing the plant are, in general, not 
known for this control arrangement. In this problem, however, it is as- 
sumed that the plant is known to be a Cartesian-to-cylindrical coordinates 
converter as illustrated in Figure P8.3. For the input d specified in Figure 
P8.7(b), the measurements of x and o have been obtained as indicated on 
the figure. Compute 

(a) plant output vector y 
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(a) 



Cartesian / Cylindrical Coordinates 
x, y, z / r, <(>, 0 

x=rsinØcos<|>, y=rsinØsinc|>, z=rcosØ 

sinø=y^^, sin4.=^= 

(b) 


Figure P8.5 Conversion of Cartesian to spherical coordinates: (a) plant and (b) notation and 
relationships. 


(b) control error ||d — y|| at the plant output 

(c) error ||x — o|| which is used for training of network A. 

Now assume that following the training, both networks A and B are ideally 
in tune with the plant inverse. Compute x, y, and |[x— o|| for this condition 
if the neurocontroller input has changed to 


d = 


0.275 

0.725 

0.300 


P8.8 The plant shown in Figure P8.8 is controlled in a specialized on-line 
leaming control architecture. In general, the plant equations y, = y i (o l ,o 2 ), 
i = 1, 2, are not known; however, for the purpose of this exercise it is 
assumed that 


1 2 

yi = 2°i + SU1 °2 

y 2 = °1+ O l ^/o 2 


(a) Compute the Jacobian matrix from the analytical description of the 
plant. 

( b ) Knowing that the operating point of the plant is Oj* = 0.472 and 
o 2 * = 0.125, compute the four entries of the Jacobian matrix J at 
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Neurocontroller 



(a) 


’ 0.5590" 


’0.5520’ 


'0.5121 

0.4636 

, x = 

0.1919 

, 0 = 

0.2275 

0.3333 


_ 0.2716 _ 


_ 0.2870 _ 


(b) 


Figure P8.7 Feedforward control with plant inverse learning: (a) block diagram and (b) results 
of measurements before completion of training. 


this point o*: 


tyi 

tyi 

do 1 

do 2 

tyi 

tyi 

doi 

do 2 


P8.9 The plant of Problem P8.8 is working in the same learning control archi- 
tecture as shown in Figure P8.8. Compute numerically the approximations 
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for the Jacobian matrix entries at the quiescent operating point o* given as 


'øj*' 


'0.472' 

ø 2 *. 


0.125 


using the perturbation method. The disturbed values are to be assumed as 
deviated by +5 percent from the operating point, thus making the plant 
signal equal to 1.05 o*. 

P8.10 The plant described with the following nonlinear equations 




Oi <> 2 + 


1 


y 2 = sfo[ + øj sin o 2 


is used in the Control arrangement shown in Figure P8.8. Its operating 
point is 


o* = 


0.721 

0.293 


and the desired response for this input is known to be 


d = 


1.725 

1.293 


Compute numerically 

(a) the plant output y* 

(b) the output error E = (l/2)||d - y*|| 

(c) the relative error values transferred to the plant input, i.e., -dE/do x 
and —dE/do 2 , at this operating point. 

P8.ll You need to use common-sense judgment and provide advice to balance 
the cart-pole system shown in Figures 8.24 and 8.25. Three sample pairs 
of quantized images are displayed in Figure P8.ll. By analyzing each se- 
quence of two images, determine what sign of the horizontal force needs 
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Earlier image Present image 

(t-OAs) (0 





Figure P8.11 Three pairs of quantized images of the cart-pole system. 


to be applied at present to improve the system’s balance. Each of the 
three cases shown in rows (1), (2), and (3) of the figure should be treated 
separately. (Note: The force toward the right is positive; the force toward 
the left is negative.) 

(a) Use your own judgment; assume that you are the teacher and you are 
trying to train the controller. 

(b) Use the weights of the trained controller to obtain a crude approx- 
imation of the discrete perceptron activation value. Inspect Figure 
8.25(b) for this purpose and draw conclusions as to the sign of the 
force needed for balancing. 

P8.12 The discrete -time plant is described with the following difference equation: 

x(k) = 2 y{k + 1) — y(k) — 0.5 

where x{k) and y(k) denote input and output, respectively, at time instant 
numbered k. The desired output wave of the plant is specified as 

y(k) = sin + 1 

Assuming that a characteristic surface *(£) = x [y(k),y(k +1)] needs to 
be leamed by the CMAC neurocontroller, identify the initial sequence of 
plant control signal values x(0), x(l), . . . , x(12), that would be located on 
the characteristic surface. 
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P8.13 A multilayer feedforward network needs to be designed that solves the 
inverse kinematics problem as shown in Figure 8.33(b). The network’s 
eventual use will be in the control arrangement from Figure 8.35(b). 

The network should model the two-link planar manipulator as in Fig- 
ure P8.13(a), with its end effector on a desired trajectory being a straight 
segment AB as shown in Figure P8.13(b). 

(a) Draw the network architecture that would be expected to solve this 
problem. 

(b) Prepare a training set of 11 pairs of input /output data. Note that 
the desired output data are 0 l9 0 2 , in response to the end effector 
Cartesian coordinates x,y provided as inputs at each point. Assume 
/j = l 2 = 1 for this manipulator. 

P8.14 Repeat Problem P8.13 for the elliptic end effector trajectory having the 
equation 



Select ten training points on the trajectory in the first quadrant. The points 
should be rr! 18 apart as indicated in Figure P8.14. Prepare complete 
training data for l x = l 2 - 1. 

P8.15 A multilayer feedforward network needs to be trained to solve the forward 
kinematics problem. The desired trajectory of a two-link planar manipu- 
lator, in joint space, is given as 



556 


APPLICATIONS OF NEURAL ALGORITHMS AND SYSTEMS 



Figure P8.14 Trajectory to be learned by the two-link planar manipulator. 


(a) Produce 21 training pairs for trajectory points taken at t = 0, 1, 2, 
. . . , 20; assume l x — 2 and l 2 = 3. 

(b) Devise a network architecture that would be expected to train properly 
for this problem. 

P8.16* The multilayer feedforward network shown in Figure P8.16(a) needs to 
be designed to solve the forward kinematics problem. The network work- 
ing area is the circular trajectory with r = 3 shown in Figure 8.36(a). 
Implement the training of the network using the training data from Figure 
P8.16(b) [normalization of data may be needed to fit within the (—1,1) 
range]. 

P8.17* Repeat Problem P8.16 using a multilayer feedforward network of the ar- 
chitecture 2-4-4-2 as shown in Figure P8.17 to model the solution of the 
inverse kinematics problem. Use the training data set of Figure P8.16(b). 

P8.18 In this problem you are asked to design a simple “mini expert system” 
called free time advisor (FTA) that would suit your personal needs. The 
FTA is supposed to advise you which of the free time activities to choose 
from given the circumstances preceding your decision making. The five 
choices for the activities are 

(a) Socializing (date or visiting with friends, etc.) 

(b) Physical exercise (tennis, jogging, swimming, etc.) 

(c) Shopping or repairing things (taking care of necessities) 

(d) Cultural activities (seeing a good movie or reading a book, etc.) 

(e) Doing nothing (no suggestions needed here). 

Factors determining your final choice of activity are virtually countless. 
Limiting their number to a dozen or so (but no more than two dozen) 
allows us to keep the expert system development work traceable. Prepare 
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(a) 


Point 

Angle 

(deg) 

End Eff. (Cartesian) 

Joint Angles 

(rad) 

X 

y 

»i 

ø 2 

1 

5 

2.9886 

0.2615 

0.7669 

-1.9106 

2 

15 

2.8978 

0.7765 

0.9414 

— 1.9106 

3 

25 

2.7189 

1.2679 

1.1160 

-1.9106 

4 

35 

2.4575 

1.7207 

1.2905 

-1.9106 

5 

65 

2.1213 

2.1213 

1.4608 

-1.9106 

6 

55 

1.7207 

2.4575 

1.6396 

-1.9106 

7 

65 

1.2679 

2.7189 

1.8141 

-1.9106 

8 

75 

0.7765 

2.8978 

1.9886 

-1.9106 

9 

85 

0.2615 

2.9886 

2.1632 

-1.9106 

10 

90 

0 

3.0000 

2.2504 

-1.9106 


(b) 


Figure P8.16 Figure for forward kinematics Problem P8.16: (a) network architecture and (b) 
training data. 


your list of factors. These factors typically can be 

( 1 ) Money available 

( 2 ) Weather predicted 

( 3 ) Amount of time available 

( 4 ) A good book at hånd 

( 5 ) Recent history of free time activities 

( 6 ) Empty refrigerator and pantry 

( 7 ) Car makes unusual noises and needs repair 

( 8 ) More sleep or rest needed. 

For final lists of activities and factors that influence your decision, propose 
a set of test data consisting of the total of 10 to 20 pairs of input /output 
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vectors, with at least 2 pairs for each specific activity from the list. Code 
inputs as mixed variables (continuous and binary) as you see appropriate 
in the range [-1,1] and assign 0 to those inputs that do not matter. 

P8.19* Develop a multilayer feedforward character classifier for five printed digits 
shown as 5 X 5 black-white pixel maps on Figure P8.19. 

(a) Devise a suitable network architecture for a local representation clas- 
sifier. 

(b) Prepare the set of five input/output binary training vector pairs. 

(c) Train the network for zero decision errors. 

(d) Perform the recall of nondistorted digits by reusing the training input 
data. 

(e) Perform the evaluation of the classifier by recalling digits distorted 
by the center pixel (pixel 13) of the 5X5 field being white rather 
than black. 

(f) Evaluate the classifier by recalling digits distorted by reversal of input 
pixels 12, 13, and 14. 



Figure P8.19 Pixel maps for digit recognition network in Problem P 8 . 19 . 






REFERENCES 


559 


REFERENCES 

Anderson, C. W. 1989. “Leaming to Control an Inverted Pendulum Using Neural 
Networks,” IEEE Control Systems Magazine , (April): 31-37. 

Anderson, D. Ed. 1988. Neural Information Processing Systems. New York: 
American Institute of Physics. 

Anderson, J. A., and E. Rosenfield. Eds. 1988. Neurocomputing: Foundations of 
Research. Cambridge, Mass.: The MIT Press. 

Apple Computer. 1986. Imagewriter User' s Manual , Part 1: Reference. Cupertino, 
Calif.: Apple Computer. 

Arteaga-Bravo, F. J. 1990. “Multi-Layer Back-Propagation NetWork for Leaming 
the Forward and Inverse Kinematics Equations,” in Proc. Joint 1990 IEEE 
Int. Neural NetWork Conf , pp. 11-319-321. 

Barto, A. G., R. S. Sutton, C. W. Anderson. 1983 “Neuronlike Adaptive Elements 
That Can Solve Difficult Leaming Control Problems,” IEEE Trans. Systems , 
Man , and Cybernetics SMC-13(5): pp. 835-846. 

Barto, A. G. 1990 “Connectionist Leaming for Control: An OverView,” in 
Neural Networks for Control , Eds. W. T. Miller, III, R. S. Sutton, P. J. 
Werbos. Cambridge, Mass.: The MIT Press. 

Baxt, W. G. 1990. “Use of an Artificial Neural NetWork for Data Analysis in 
Clinical Decision-Making: The Diagnosis of Coronary Occlusion,” Neural 
Computation , 2: 480-489. 

Bounds, D. G., P. J. Lloyd, B. Mathew, and G. Wadell. 1988. “A Multilayer 
Perceptron NetWork for the Diagnosis of Low Back Pain,” in Proc. IEEE 
Int. Conf on Neural Networks , San Diego, Califomia. pp. 11-481-489. 

Chandra, V., and R. Sudhakar. 1988. “Recent Developments in Artificial Neural 
NetWork Based Character Recognition: A Performance Study,” Proc. IEEE 
Southeastcon f Knoxville, Tennessee. pp. 633-637. 

Craig, J. J. 1986. Introduction to Robotics Mechanics and Control. Reading, 
Mass.: Addison Wesley Publishing Co. 

Dayhoff, R. E. and J. E. Dayhoff. 1988. “Neural Networks for Medical Image 
Processing” in Proc. IEEE Symp. on Computer Applications in Medical 
Care , Washington, D.C. pp. 271-275. 

Denker, J. S. Ed. 1986. Neural Networks for Computing. AIP Conference 
Proceedings. New York: American Institute of Physics. 

Dutta, S., and S. Shekhar. 1988 “Bond Rating: A Non-conservative Application 
of Neural Networks,” in Proc. IEEE Int. Conf. on Neural Networks , San 
Diego, Califomia. 


560 


APPLICATIONS OF NEURAL ALGORITHMS AND SYSTEMS 


Eberhart, R. C. and R. W. Dobbins. 1990. Neural NetWork PC Tools y A Practical 
Guide. San Diego, Calif.: Academic Press. 

Feldman, R. S. 1987 Under standing Psychology. New York: McGraw-Hill Book 
Co. 

Fu, K. S. 1986. “Leaming Control Systems — Review and Outlook,” IEEE Trans. 
Pattern Analysis and Machine Intelligence 8(3): 327-342. 

Gallant, S. I. 1988. “Connectionist Expert Systems,” Comm. ACM 31(2): 152- 
169. 

Gorman, R., and T. Sejnowski. 1988. “Leamed Classification of Sonar Targets 
Using a Massively Parallel NetWork,” IEEE Trans. Acoustics , Speech , and 
Signal Proc. 36: 1135-1140. 

Graf, H. P., L. D. Jackel, and W. E. Hubbard. 1988. “VLSI Implementation of 
a Neural NetWork Model,” IEEE Computer (March): 41-49. 

Guez, A., J. L. Eilbert, and M. Kam. 1988. “Neural NetWork Architecture for 
Control,” IEEE Control Systems Magazine (April): 22-25. 

Hashimoto, H., T. Kubota, M. Kudon, and F. Harashima. 1990. “Visual Control 
of a Robotic Manipulator Using Neural Networks,” in Proc. 29th Conf. on 
Decision and Control , Honolulu, Hawaii, pp. 3295-3302. 

Hecht-Nielsen, R. 1990. Neurocomputing. Reading, Mass.: Addison-Wesley 
Publishing Co. 

Hertz, J., A. Krogh, and R. G. Palmer. 1991 Introduction to the Theory of Neural 
Computation. Redwood City, Calif.: Addison-Wesley Publishing Co. 

Howard, R. E., B. Boser, J. S. Denker, H. P. Graf, D. Henderson, W. Hubbard, L. 
D. Jackel, Y. LeCun, and H. S. Baird. 1990 “Optical Character Recognition: 
A Technology Driver for Neural Networks,” Proc. IEEE Int. Joint Neural 
Network Conf. pp. 2433-2436. 

Jackel, L. D., H. P. Graf, W. Hubbard, J. S. Denker, and D. Henderson. 1988. 
“An Application of Neural Net Chips: Handwritten Digit Recognition,” 
in Proc. IEEE Int. Conf on Neural Networks, San Diego, Califomia. pp. 
11-107-115. 

Jones, D. 1991. “Neural Networks for Medical Diagnosis,” in Handbook of 
Neural Computing Applications. New York: Academic Press. 

Josin, G., D. Chamey, and D. White. 1988. “Robot Control Using Neural Net- 
works,” Proc. IEEE Int. Conf. on Neural Networks , San Diego, Califomia. 
pp. 11-615-631. 

Kawato, M., Y. Uno, M. Isobe, and R. Suzuki. 1988. “Hierarchical Neural 
Network Model for Voluntary Movement with Applications to Robotics, 
IEEE Control Systems Magazine (April): 8-16. 


REFERENCES 


561 


Kohonen, T. 1990. “The Self-Organizing Map,” Proc. ofthelEEE 78(9): 1464- 
1480. 

Kraft, L. G., and D. S. Campagna. 1990a. “A Comparison Between CMAC 
Neural NetWork Control and Two Traditional Adaptive Control Systems,” 
IEEE Control Systems Magazine (April): 36-43. 

Kraft, L. G., and D. S. Campagna. 1990b. “A Summary Comparison of CMAC 
Neural NetWork and Traditional Adaptive Control Systems,” in Neural 
Networks for Control, ed. T. W. Miller III, R. S. Sutton, and R J. Werbos. 
Cambridge, Mass.: The MIT Press. 

LeCun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, 
and L. D. Jackel. 1989. “Backpropagation Applied to Handwritten Zip 
Code Recognition,” Neural Computation 1: 541-551. 

Liu, H., T. Iberall, and G. A. Bekey. 1989. “Neural NetWork Architecture for 
Robot Hånd Control,” IEEE Control Systems Magazine (April): 38-43. 

Lippman, R. P. 1989. “Review of Neural Networks for Speech Recognition,” 
Neural Computation 1(1): 1-38. 

Miller, W. T., III. 1989. “Real-Time Application of Neural Networks for Sensor- 
Based Control of Robots with Vision,” IEEE Trans. Systems, Man, and 
Cybernetics 19(4): 825-831. 

Mishkin, E., and L. Braun. Eds. 1961. Adaptive Control Systems. New York: 
McGraw-Hill Book Co. 

Narendra, K. S., and K. Parthasarthy. 1990. “Identification and Control of 
Dynamical Systems using Neural Networks,” IEEE Trans. Neural Networks 
1(1): 4-21. 

Narendra, K. S., and K. Parthasarathy. 1991. “Gradient Methods for the 
Optimization of Dynamical Systems Containing Neural Networks,” IEEE 
Trans. Neural Networks 2(2): 252-262. 

Nguyen, D. H., and B. Widrow. 1990. “Neural Networks for Self-leaming 
Control Systems,” IEEE Control Systems Magazine (April): 18-23. 

Nguyen, L., R. V. Patel, and K. Khorasani. 1990. “Neural NetWork Architectures 
for the Forward Kinematics Problem in Robotics,” in Proc. Joint IEEE lnt. 
Neural Networks Conf, San Diego, Califomia. III-393-399. 

Norris, D. E. 1986. “Machine Leaming Using Fuzzy Logic with Applications in 
Medicine,” Ph.D. diss., University of Bristol, United Kingdom. 

Omidvar, O. M. Ed. 1992. Progress in Neural Networks. Norwood, N.J.: Ablex 
Publishing Co. 

Poggio, T., and F. Girosi. 1990. “Networks for Approximation and Leaming,” 
Proc. IEEE 78(9): 1481-1497. 


562 


APPLICATIONS OF NEURAL ALGORITHMS AND SYSTEMS 


Pomerleau, D. A. 1989. “ALVINN: An Autonomous Land Vehicle in a Neural 
NetWork,” in Advances in Neural Information Processing Systems , vol. 1, 
D. Touretzky. San Mateo, Calif.: Morgan Kaufmann Publishers. 

Psaltis, D., A. Sideris, and A. Yamamura. 1987. “Neural Controllers,” in Proc. 
IEEE Int. Neural Networks Conf. , San Diego, Califomia. pp. IV-551-558. 

Psaltis, D., A. Sideris, and A. Yamamura. 1988. “A Multi-layer Neural NetWork 
Controller,” IEEE Control Systems Magazine (April): 17-21. 

Ritter, H., and T. Kohonen. 1989. “Self-Organizing Semantic Maps,” Biolog . 
Cybern. 61: 241-254. 

Saerens, M., and A. Soquet. 1991. “Neural Controller Based on Back-Propagation 
Algorithm,” IEE Proc., Part F 138(1): 55-62. 

Sejnowski, T., and C. Rosenberg. 1987. “Parallel Networks that Leam to 
Pronounce English Text,” Complex Systems 1: 145-168. 

Simmons, D. M. 1972. Linear Programming for Operations Research . San 
Fransisco: Holden-Day. 

Sobajic, D. J., J. J. Lu, and Y. H. Pao. 1988. “Intelligent Control for the 
Intelledex 605 T Robot Manipulator,” in Proc. 1988 IEEE Int . Neural 
Networks Conf, San Diego, Califomia. pp. 11-633-640. 

Tank, D. W., and J. J. Hopfield. 1986. “Simple ‘Neural’ Optimization Networks: 
An A/D Converter, Signal Decision Circuit and a Linear Programming 
Circuit,” IEEE Trans. Circ. Syst. CAS-33(5), 533-541. 

Tolat, V. V., and B. Widrow. 1988. “An Adaptive ‘Broom Balancer’ with 
Visual Inputs,” in Proc. 1988 IEEE Int. Neural Networks Conf, San Diego, 
Califomia. pp. 11-641-647. 

Touretzky, D. Ed. 1989. Advances in Neural Information Processing Systems, 
vol. 1. San Mateo, Calif.: Morgan-Kaufmann Publishers. 

Touretzky, D. Ed. 1990 Advances in Neural Information Processing Systems, 
vol. 2. San Mateo, Calif.: Morgan-Kaufmann Publishers. 

Tsypkin, Ya. Z. 1973. Foundations of the Theory of Learning Systems. New 
York: Academic Press. 

Villalobos, L., and S. Gruber. 1990. “Interpolation Characteristics and Noise 
Sensitivity of Neural NetWork Based Inspection of Machined Surfaces,” in 
Proc. 1990 IEEE Int. Conf on Robotics and Automation. 

Wang, L., and J. M. Mendel. 1990. “Structured Trainable Networks for Matrix 
Algebra,” in Proc. IEEE 1990 Joint Conf. on Neural Networks, San Diego, 
Califomia. pp. 125-132. 

Widrow, B., and R. Winter. “Neural Nets for Adaptive Filtering and Adaptive 
Pattem Recognition,” IEEE Computer (March): 25-39. 


REFERENCES 


563 


Widrow, B., and M. A. Lehr. 1990. “30 Years of Adaptive Neural Networks: 
Perceptron, Madaline, and Backpropagation,” Proc. IEEE 78(9): 1415— 
1441. 

Wilhelmsen, K., and N. Cotter. 1990. “Neural NetWork Based Controllers for a 
Single-Degree-of-Freedom Robotic Arm,” in Proc. 1990 Int. Joint Conf. on 
Neural Networks , San Diego, Califomia. pp. III-407-413. 

Yoon, Y. O., R. W. Brobst, P. R. Bergstresser, and L. L. Peterson. 1989. “A 
Desktop Neural NetWork for Dermatology Diagnosis,” Journal of Neural 
NetWork Computing, (Summer): 43-52. 

Zigoris, D. M. 1989. “Performance Evaluation of the Back-Propagation Algo- 
rithm for Character Recognition,” M. Eng. thesis, University of Louisville, 
Kentucky. 

Zurada, J., D. M. Zigoris, P. B. Aronhime, and M. Desai. 1991. “Multi-Layer 
Feedforward Networks for Printed Character Classification,” Proc. 34th 
Midwest Symp. on Circuits and Systems, Monterey, Califomia, May 14-16. 





NEURAL NETWORKS 
IMPLEMENTATION 

Physical models are as different 
from the world as a geographical 
map is from the surface of the earth. 

L. BRILLOUIN 

9.1 Artificial Neural Systems: OverView of Actual Models 

9.2 Integrated Circuit Synaptic Connections 

9.3 Active Building Blocks of Neural Networks 

9.4 Analog Multipliers and Scalar Product Circuits 

9.5 Associative Memory Implementations 

9.6 Electronic Neural Processors 

9.7 Concluding Remarks 


A rtificial neural systems are currently realized in a number of ways. As we 
have seen, they can be implemented as computer simulations. This approach 
has been used for a number of examples in this text. Numerical simulations also 
support most of the computational end-of-chapter exercises involving training 
and recall. For the best results and real-life applications, however, artificial neural 
networks need to be implemented as analog, digital, or hybrid (analog /digital) 
hardware. Moreover, neural network processors rather than digital computer sim- 
ulations seem to be the key ingredient to further expansion and commercialization 
of neural network technology. Fortunately, study of the technical reports and of 
products available indicates that electronic implementations of neural networks 
are feasible and promising. This provides us with the focus and justification for 
neural network hardware implementation study in this chapter. 

While neural network hardware development is being pursued in a number 
of research centers, neurocomputing concepts and applications are usually tested 
through simulations of neural algorithms on digital computers. To this aim, a very 
wide spectrum of computers, called programmable neurocomputers , can be used. 
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A simple microcomputer running neural network training or recall software can 
therefore be termed a programmable neurocomputer. We should realize, however, 
that an especiaily efficient programmable neurocomputer can be built by tailoring 
its architecture to the flow of the neural algorithm. This would involve organiz- 
ing the computations in parallel, scheduling appropriate data communication, and 
specializing in arithmetic or transfer operations that are performed frequently. 

A number of companies have brought coprocessor accelerator boards on the 
market. They typically interface to personal computers or conventional worksta- 
tions and are able to enhance the speed of numerical computations considerably. 
Coprocessor accelerator boards also make the simulations of neural algorithms 
faster and more accurate. A natural progression of these coprocessors has pro- 
duced neural array processors with dedicated architecture and computation flow. 
A number of specialized parallel neurocomputers which are programmable have 
been developed. 

However, any programmable sequential general-purpose neurocomputer is 
an order of magnitude slower than neural hardware, which could be directly pro- 
duced using the same fabrication technology as the programmable neurocomputer. 
Our objective in this chapter is to review briefly programmable neurocomputers 
and provide extensive fundamentals on neural hardware. Such neural hardware is 
called a neurocomputer . Neurocomputers are dedicated computing devices work- 
ing with embedded neural processing algorithms and employ parallel processing 
to increase the throughput. Programming is not required for their operation, 
but they often interface with a programmable training controller. The high- 
performance parallel computing unit must be matched with an efficient interface 
to avoid bottlenecks in training and recall network functions. We will focus in this 
chapter on silicon-based neural hardware which represents the largest category 
among special-purpose analog and digital neurocomputers. 


ARTIFICIAL NEURAL 
SYSTEMS: OVERVIEW OF 
ACTUAL MODELS 

Throughout the preceding chapters we have covered neural network con- 
cepts and applications, but only marginal attention has been paid to budding 
actual working models of artificial neural systems. Our study thus far indicates, 
however, that the computation by neural networks whether implemented on a 
programmable or nonprogrammable neurocomputer is performed differently than 
computation by ordinary digital computers. In this section we attempt to address 
these differences. In addition, an overview of computational requirements for 
implementing connectionist models is the focus of this section. 


9.1 ARTIFICIAL NEURAL SYSTEMS: OVERVIEW OF ACTUAL MODELS 


567 


The diversity of neural network concepts, algorithms, and abstractions, 
which dominated the coverage in previous chapters, also extends into the area of 
building their physical models. We will see that artificial neural networks can be 
built using analog, digital, or hybrid dedicated electronic or optical hardware. As 
mentioned before, neurocomputing algorithms can be tested and implemented on 
either dedicated or general-purpose conventional computers, called programmable 
neurocomputers. In addition, mixtures of computer simulations with digital or 
analog integrated Circuit computations are often applied. These usually dedicated 
circuits can be built in a variety of technologies to assist with efficient imple- 
mentations of specialized neurocomputing algorithms. 

Let us look at an example of a neurocomputer that involves such mixed 
computations. One of the commercial neurocomputers involves simulation of 
error back-propagation training on a conventional programmable host computer. 
Subsequent transfer (downloading) and storage of weights, which result from the 
training, are performed by the digital circuitry. However, information stored in the 
network is recalled through analog computation within the designated electronic 
circuitry. In faet, the analog computation mode and parallel form of information 
processing and retrieval are responsible for the fast recall performed by the 
discussed network. In this neurocomputer we thus have numerical simulation of 
training, which is performed off-chip, digital transfer and data storage circuitry, 
and a dedicated analog neural network that performs recall only. 

Let us focus on efficient hardware implementation of neural algorithms. 
Review of processing algorithms indicates that batch leaming and ineremental 
leaming either from current errors or from past experience, as well as recall of 
stored information, are all executed as sequences of relatively simple arithmetic 
operations such as weighted product summation followed by a nonlinear mapping. 
The most commonly encountered operations are scalar and outer product vector 
multiplication and matrix-vector multiplication. These operations are performed 
as a series of multiply and /or multiply and add operations. Additions and sub- 
tractions of matrices and vectors, and ordinary multiplications often involving 
the generation of a nonlinear activation funetion, are less frequent, but are also 
indispensable for most leaming and recall tasks. 


Node Numbers and Complexity of 
Computing Systems 

Let us attempt to position neural computation in the more general context 
of computation. One of the most important features of artificial neural systems is 
that they perform a large number of numerical operations in parallel. These op- 
erations involve simple arithmetic operations as well as nonlinear mappings and 
computation of derivatives. Almost all data stored in the network are involved 
in recall computation at any given time. The distributed neural processing is 
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Node complexity 

Figure 9.1 Comparison of the number of nodes and single node complexity of computing 
hardware. [Adapted from Treleaven, Pacheco, and Vellasco (1989).] 


typically performed within the entire array composed of neurons and weights. 
This indicates that parallel distributed computing makes efficient use of the stored 
data (weights) and of the input data (inputs). 

In contrast to the neural processor arrays performing parallel computations, 
conventional computers operate on a very limited number of data at a time. The 
arithmetic-logic unit within the processor of a conventional computer manipulates 
only two words just recently fetched from memory. Although the same processors 
can perform a variety of complex instructions very quickly, many megabytes of 
data are idle during any instruction cycle. The duty cycle of a processor is close 
to 100%, but that of the stored data is close to zero in conventional computation. 
This explains the computational bottleneck when conventional computers must 
deal with very large input /output data rates and perform tasks such as com- 
puter vision or speech recognition. This also explains why neural networks can 
handle massive amounts of input / output data more efficiently than conventional 
computers. 

Figure 9.1 illustrates the comparison of conventional computers with the 
parallel distributed processing hardware. The parameters compared are the num- 
ber of addressable computing nodes versus their complexity. We adopt for this 
comparison a rather broad concept of the computing node. Any addressable unit 
ranging from RAM cell to processor is understood in this discussion to be a 
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computing node. The simplest processing node is the RAM cell, which typically 
consists of one and rarely more than three transistors. However, it has no pro- 
cessing power beyond its ability to retain the present binary content or to acquire 
a new one. In spite of their simplicity, millions of memory cells are needed for 
operation of a modem digital computer. We may place the central processing 
unit (CPU) of a computer at the other end of the Spectrum of nodes. The CPU 
is the most complex processing node; however, only a limited number of CPUs 
are needed for a computer to operate. 

Neural networks fail by nature of their parallel distributed processing into 
the middle ground between the two extreme examples of nodes of RAM cell and 
CPU. A medium number of processing nodes, each of low to medium complexity, 
perform local neurocomputing on local data. Ideally, each processing node, or 
neuron, should be rather simple and small to accommodate the placement of many 
of them on an integrated Circuit chip. Since the neural processing node does not 
need to perform dozens of diversified logic and arithmetic operations, it can be 
hard-wired to perform just its basic computational function (Mackie et al. 1988). 
Its operation is rather straightforward and can often be tailored to a particular 
application or network paradigm. 

Let us review the choice of an artificial neural system model, a choice that 
has to precede solution of a neurocomputing task. A specific neurocomputing 
task needs to be translated onto an artificial system model so that it can be 
implemented. Since the choice of an algorithm depends on the specific user’ s 
task to be accomplished, the user must first consider the selection of an optimal 
processing algorithm. The problem for which a suitable algorithm has been found 
is then transferred to the model to compute a solution. 

This sequence of steps is illustrated in Figure 9.2 showing the problem- 
algorithm-model flowchart. Altematively, the user’s problem can be transferred 
directly to the artificial neural system model for execution. This may be the case 
for well-defined and known tasks for which the choice of the successful model 
has been made, and parameters of the algorithm have already been developed 
based on the previous experience of the user. 


Neurocomputing Hardware Requirements 

Artificial neural system models contain both invariant and variable param- 
eters, which are called characteristics here and are essential for the system’s 
operation (Reece and Treleaven 1988). Invariant characteristics are fixed in the 
neurocomputing model, while variable characteristics can be modified. Invariant 
characteristics usually include, among others, network topology, number of bits, 
or simply accuracy in expressing the weight value, number of bits describing the 
activation and output values of the neuron, and interconnection density. Variable 
characteristics of the model typically include its learning parameters, sequence 
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Figure 9.2 Problem-algorithm-model flowchart. 


of leaming and recall operations, as well as the presence or lack of synchronous 
neuron updates, etc. 

Simulation of most neural network models in real-life applications can be- 
come computationally intensive to the degree that the processing bottleneck on 
conventional computers imposes constraints on practical explorations of large 
neural networks. In faet, limitations of computational power available has dwarfed 
the leaming machines’ research wave of a quarter century ago. 

As mentioned before, there are two distinet approaches currently being taken 
for supporting artificial neural system modeling: 

1 . General-purpose computers, which are programmable and therefore able to 
simulate a variety of models 

2. Special-purpose hardware, often dedicated to a specific neural processing 
model. 

Let us first outline the requirements for neurocomputing hardware. This outline 
will allow us to specify the set of features that neuroemulators should have. Un- 
derstandably, the features of any given artificial neural network depend primarily 
on its specific application. Let us look first at such properties as recall time, 
network size, and input layer size. Figure 9.3(a) tabulates five applications of 
layered feedforward networks. The applications listed differ widely from each 
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Recall 

Time 

Network Size 

Input 

Layer 

Application 

S 

Neurons 

Weights 

Size 

Radar Pulse 
Identification 

50 • 10“ 6 

384 

32 k 

128 

Robot Arm 
Controller 

0.01-1 

312 

3.6 k 

5 

Isolated Words 
Recognition 
(1024- word 
vocabulary) 

.01 

10 k 

2.5 k 

25 

Lo w Level Vision 
100 weights /neuron 

.04 

64 k 

6400 k 

64 k 

Financial Risk 
Evaluation 

1 

5-10 k 

4000 k 

? 


(a) 

Figure 9.3a Neural network comparison for typical applications: (a) typical sizes and recall 
times. [Adapted from DARPA (1988).] 


other in required processing speed and memory size. For example, for iden- 
tification of radar pulse signals, processing time is the most severe constraint 
but memory size is not. On the other end, financial risk evaluation modeling 
requires a very large, but not necessarily very fast, network. The response time 
of financial risk evaluation can be in the range of a few seconds, while the 
radar pulse Identification must be performed within 50 fis (Jutten, Guerin, and 
Herault 1990). Some other applications, such as vision, may require unusually 
large input /output bandwidth. The input /output bandwidth is understood here to 
be the number of bits per second that can be processed by the input /output unit. 
This brief comparison of computational requirements for various tasks indicates 
that a trade-off may be necessary between a network’s processing time, memory 
size, and the input /output bandwidth. When considering particular applications, 
these three features should be balanced and geared toward the specific problem 
requirements. 

Let us now consider the memory and network size of an artificial neural sys- 
tem. The size of a neural network is determined by the overall number of neurons 
and the total number of inputs, but mainly by the number of its weights. Thus, 
neural network size is about directly proportional to the number of weights. For 
example, there are (n — l) 2 weights in fully coupled single-layer networks with n 
neurons. Since the number of weights drastically exceeds the number of neurons, 
the size of the network is determined primarily by the number of weights. 

Assume that a neural network model is simulated on a digital computer. 
Each weight is typically characterized by a single value, and only the current 
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Application 

Memory 

Size 

Words 

Word 

Length 

Bits 

1/0 

Bandwidth 
W/s or b/s 

CPS 

Measure 

(Recall) 

MFLOPS 

Measure 

(Recall) 

Radar Pulse 
Identification 

< 32 k 

1/ neuron 
32 / synapse 

2.5 Mb /s 

640 M 

1.3 G 

Robot Arm 
Controller 

4.5 k 

32 

< 600 W/s 

< 350 k 

< 700 k 

Isolated Words 
Recognition 

300 k 

32 

2.5 kW/s 

24 M 

48 M 

(1024- word 
vocabulary) 

Low Level Vision 
100 weights /neuron 

6.4 M 

1 / neuron 
8 / synapse 

1.6 Mb/s 

156 M 

312 M 

Financial Risk 
Evaluation 

4.1 M 

? 

< 7 kW/s 

4 M 

8 M 


(b) 

Figure 9.3b Neural network comparison for typical applications (continued): (b) hardware 
requirements. [Adapted from DARPA (1988).] 


weight value needs to be stored during leaming. This also applies to error 
back-propagation training of multilayer feedforward networks. However, if ei- 
ther cumulative weight adjustment training, or training with a momentum term, 
is implemented, the most recent weight value must be memorized in addition 
to its present value. Furthermore, activation values, output values, and the error 
term value must be stored for each neuron of the layer that undergoes training. 

At the data format level, the machine representation must be more accurate 
for weights than for any of the remaining data. In particular, weight updates 
resulting during incremental supervised or unsupervised leaming are relatively 
small to provide training stability and convergence. A floating-point format for 
representing weights by number is therefore necessary. Floating-point represen- 
tation for weights typically requires a mantissa with 24 or 32 bits (Bessiere 
et al. 1991). 

The extemal environment in which the neural network is used usually im- 
poses the speed of data transfer. A high sampling rate, for example, would be 
necessary to process large video images. In addition to communication speed 
requirements, a neural network can be characterized by its computational band- 
width. The computational bandwidth for conventional computers is typically 
expressed in units such as MFLOPS (million of floating-point operations per 
second), or MIPS (million of instructions per second). For simulation of artificial 
neural systems the “CPS” (connections per second) unit has recently emerged as 
an efficiency measure (Bessiere et al. 1991). 
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The CPS unit has become the popular catch phrase for neurocomputing 
products. Its generality is severely limited, however. It describes only the pro- 
cessing speed for the recall mode of a multilayer feedforward network. CPS 
units neither reflect the speed of leaming of any architecture nor the processing 
speed within recurrent networks. Also, the number of CPS units does not include 
auxiliary system operations, which support inputs and outputs. 

Let us compare the units of CPS and FLOPS for feedforward networks sim- 
ulated on a conventional computer. The recall pass through a single neuron with 
n connections requires approximately n multiplication-accumulation operations, 
one for each weight. Assuming that multiplication-accumulation is equivalent to 
approximately two floating-point operations of the CPU, we can establish that 
there is 1 : 2 ratio between CPS measure and FLOPS measure. 

Figure 9.3(b) outlines the hardware requirements for the five artificial neural 
systems discussed earlier in Figure 9.3(a). The table lists the required CPS and 
FLOPS measures expected for each of the five applications. It can be seen from 
the table that each of the networks not only requires a different size of the 
memory measured in terms of memory words, but memory word lengths also 
differ. In addition, I/O bandwidth requirements differ widely among networks 
for different applications. 

As an example, the application of radar pulse identification requires 2.5 
Mb /s I/O bandwidth, which corresponds to the transfer of a 128-tuple binary 
vector every 50 /jls. The network performing this task is of rather moderate 
size starting at 32 000 weights and having 256 binary neurons arranged in two 
layers plus one layer of input nodes. If a binary input vector must be processed 
every 50 /z,s, then 32 000 connections must be implemented 20,000 times per 
second. We would thus have the required CPS measure of approximately 640 
MCPS. This corresponds to a computational power of about 1.3 MFLOPS of the 
computer simulating efficiently the recall stream of radar pulses that needs to 
be recognized. 

Let us attempt to outline the general computational requirement estimates 
for two important cases of neural processing. Our concem will be digital com- 
puter simulation of leaming in feedforward networks of multilayer perceptrons, 
and simulation of recall in discrete-time, fully coupled recurrent networks. Fig- 
ure 9.4 lists the average number of multiplication-accumulation operations per 
weight. Single-layer and multilayer feedforward network leaming, and single- 
layer and recurrent network recall are included in this comparison. The number 
of operations listed in the figure are understood to be estimates and are speci- 
fied as range of values. This allows accounting for supporting operations other 
than multiplication and addition. Also, the precise average number of operations 
would depend on the complexity of the activation function computation. The 
function can be defined for computation as a permanent look-up table, generated 
by polynomial expansion or calculated directly. Each of the methods requires a 
different computational expense. 
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Processing 

Mode 

Single-layer 

Network 

Multilayer 

Network 

Recurrent 

Network 

(Discrete-time) 

Leaming 

4-10 

7-12 

Batch 

Recall 

2-3 

2-3 

(2 - 3) X R 

Total 

6-13 

9-15 



ft— number of recurrent updates of all neurons until stability. 


Figure 9.4 Average estimated number of multiplication-accumulation operations per weight for 
three selected architectures. [Adapted from Jutten, Guerin, and Herault (1990).] 


It can be seen that the leaming of a multilayer feedforward architecture 
is the most computationally intensive neural processing form. Let us use the 
data of Figures 9.3 and 9.4 to estimate the required computational power of a 
conventional computer performing efficient leaming of isolated word recognition 
using a 1024-word vocabulary. Assume that the training needs to take 4000 
full presentation cycles. Thus, about 4 M input vectors are submitted to the 
network during leaming, or 4 M incremental leaming steps are needed. Since 
the network contains 2500 weights, assuming 12 operations per single weight 
adjustment would yield the total of 120 G required operations. It then takes 
about 42 min of computing time for a 48-MFLOPS computer to complete the 
training, or only 1 min for a 2000-MFLOPS computer to accomplish the same 
task. 

The calculation shows that the powerful computer can produce the multilayer 
feedforward network in question within a reasonable training time, provided the 
training converges to an acceptable solution. Improvements of training efficiency 
can be achieved in a number of forms. Some of them are: 

■ Dedicated hardware architectures of conventional programmable 
computers 

■ Parallel distributed architectures of densely interconnected computing 
neural nodes containing conventional multiply or multiply-add processors 

■ Reduction of the volume of the training data through the use of the 
generalization properties of networks 

■ Designing neurocomputing integrated circuits using digital, analog, 
or digital /analog array s performing simultaneous local computations 
within the entire array. 

As stated before, the availability of neurocomputing integrated circuits offers 
the most advantageous alternative for multifaceted utilization of neural networks. 
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Not only would the processing time drastically decrease for integrated neurocom- 
puting circuits, but also the size, power supply requirements, and the retraining 
possibility of networks would render miniature neurocomputers very attractive. 


Digital and Analog Electronic 
Neurocomputing Circuits 

To review special-purpose neurocomputing hardware, the exposition of artifi- 
cial neural systems must descend to the level of the neuron. The basic processing 
unit defined as the processing node (neuron) and weights (synaptic connections) 
was shown earlier in Figure 2.4. The processing unit needs to memorize data 
stored as connection weights, to generate a neuron’ s activation value, and to 
compute the unit’s output signal. These features would make it possible for the 
unit to perform recall functions. In addition, the ability to store the leaming rule 
and to adapt weights according to the rule would be the most desirable property 
of the processing node. The processed signals and functions of the node can be 
either analog, digital, or of a combined type (Goser et al. 1989). 

Let us review some basic issues of budding a neural processing node. To 
calculate the activation value using the conventional digital computer simulation 
method, the weight values stored in a local memory cell are first multiplied by 
the input values and then the products are added. More efficiently, the activation 
value can be compounded through serial calculation as follows 

net { 0 (9.1) 

neti <— net i + w^Xj, for j = 1 TO n STEP 1 

This commonly used multiply-accumulate sequence (9.1) requires synchronous 
operation of related digital hardware. High-precision calculations and consider- 
able noise immunity are inherent in this computing method. While high-precision 
computation of a neuron’ s response as in (9.1) is often indispensable for the 
leaming mode, the recall function usually does not require the same degree 
of accuracy. Furthermore, it is difficult to avoid the view that the computa- 
tion algorithm (9.1) cannot be substantiated by any biological analogies. Let 
us now therefore look at analog computation performed by a neural processing 
node. Analog computation is not only more plausible in biological neural net- 
works, but constitutes the fundamental form of processing for nonprogrammable 
neural hardware. 

In the case of an analog processing node, knowledge is encapsulated in ana- 
log weights Wy. Leaming of the network proceeds via relatively slow changes of 
weights w t j and only in the training mode. Recall proceeds as simultaneous analog 
computing within the entire array encompassing all nodes. In a simple version, 
weights can be determined by resistance values, and the analog computation of 
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(0 


Figure 9.5 Neuron and weight implementalion using resistors and an operational amplifier: 

(a) Circuit diagram, (b) transfer characteristics of an operational amplifier (open-loop), and 
(c) open-loop operational amplifier symbol. 

the scalar product and subsequent nonlinear mapping can be performed by a 
summing amplifier with saturation. 

A conventional operational amplifier can implement these node functions as 
follows 

/(x)=/|x«'rt| (9.2) 

A sample analog implementation of the processing node performing Equation 
(9.2) is shown in Figure 9.5(a). The neuron, or cell body, is formed from an 
infinity-gain operational amplifier with strong negative feedback. Weight value 
Wj is proportional to the conductance Gj. As derived below, the feedback resis- 
tance R f provides here a proportionality factor common for all weights. Let us 
analyze the computing node circuit and evaluate its weights in terms of circuit 
components. 
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Figure 9.5(b) illustrates the voltage transfer characteristics f(v+ - v_) of the 
open-loop high-gain operational amplilier. Figure 9.5(c) shows the Circuit symbol 
of the open-loop operational amplifier for which the characteristics shown are 
valid. In the linear region, the open-loop operational amplifier performs as a 
constant-gain amplifier of high gain. The open-loop gain value of the operational 
amplifier in the linear region corresponds to the slope of the steep part of the 
characteristics displayed in Figure 9.5(b). The output voltage / of the operational 
amplifier, however, can neither exceed the / sat+ value, nor can it drop below the 
/ sat _ value. This is due to the saturation of transistors within the device. Typically, 
saturation voltages / sat _ and / sat+ are of equal magnitude and of opposite signs. 
The values of / sat+ and / sat _ are about 1 V below the operational amplifier supply 
voltage Vdd< and 1 V above v ss , respectively. 

As a consequence, an operational amplifier operates within a very narrow 
range of input voltages. This range is (v sat _, v sat +). However, the differential input 
voltage v+ — v_ remains in this range only if the operational amplifier is in its 
linear region of operation. Note that only the operational amplifier with a closed 
negative feedback loop remains in the linear region. The negative feedback loop 
is closed when the resis tance R F connects the output with an inverting input of the 
operational amplifier. The neuron’s Circuit from Figure 9.5(a) can be considered 
as having the discussed properties. 

It can be seen that any rise in output voltage / causes an increase of current 
i F , which in tum elevates the potential v_. Such an increase is caused as the 
result of an additional voltage drop across the device input resistance R ( . This 
resistance can be measured between the noninverting and inverting input termi- 
nals and is typically of large value. The increase of voltage across R h however, 
results in a decrease in the output voltage since an increase in v_ brings about a 
decrease in the voltage /. The final outcome of this process is that the negative 
feedback around the operational amplifier enforces its stabilization in the linear 
region. More precisely, the existing negative feedback attempts to stabilize its 
differential input voltage v+ — v_ near 0 V. 

Inspection of Figure 9.5(a) indicates that the negative feedback enforces the 
condition v+ — v_ = 0 V. Since the noninverting input of the operational amplifier 
is grounded and thus v+ = 0 V, potential v_ remains very close to 0 V. This 
phenomenon is called Virtual ground and it makes the analysis of operational 
amplifier circuits rather simple. The advantage of the virtual ground node is that 
while being separated from the ground by a very large resistance R i9 the node 
behaves as grounded. The property holds for the operational amplifier inverting 
input node of Figure 9.5(a). This remains true as long as the negative feedback 
prevails. 

The KCL equation for the inverting input node of the Circuit from Figure 
9.5(a) can be obtained as 


h + h + • - • + i n = ~h 


(9.3a) 
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which is equivalent to 

(*i - v__)G x + (x 2 - v_)G 2 + V- • + (x n - V-)G n = (v_ 
Using the Virtual ground propérty we obtain 

x x G x + x 2 G 2 + . . . + x n G n 


/) 


Rf 


Rf 


(9.3b) 


The output voltage of the neuron can now be obtained from (9.3b) for the linear 
range of operation as 


where 


/OO = I Xj(-R f Gj) 
j~ X 

X xÅ-R F Gj) = net 
j= i 


(9.4a) 


(9.4b) 


Comparison of the coefficients in expressions (9.2) and (9.4) leads to the con- 
clusion that the weights implemented by the circuit of Figure 9.5(a) are 


Wj — RpGj 


(9.5a) 


Furthermore, considering the definition of activation net as in (9.4b), the neuron 
implements here a unity-gain activation function in the linear region. In faet, the 
amplifier produces a linear combination of input voltages with weights (9.5a). 
The amplifier saturates whenever/(x) reaches and attempts to exceed / sat +. It 
also saturates when/(x) attempts to drop below / sat _. This allows for expression 
of the activation function in the following form: 


f(net) 


/sat- ’ 

net , 

fsSLt+ * 


for net </ sat _ 

/sat— ^ ft^t ^ / sa t-f 

for net >/ sat+ 


(9.5b) 


The activation value net is not measurable or physically present anywhere in 
the discussed neural computing node circuitry. It is helpful, however, to define 
and use this variable as a hypothetical input voltage of a unity-gain saturating 
amplifier. 

Note that the discussed processing node configuration is somewhat prelimi- 
nary. The unit basically implements the identity activation function. Also, positive 
weights cannot be achieved in this circuit without some additional modifications. 
However, it is possible to produce a nonzero bias level in this configuration. This 
can be easily accomplished by connecting the noninverting input terminal of the 
operational amplifier to the bias voltage T. As a consequence, the Virtual ground 
potential of the inverting operational amplifier input is shifted to the value of T. 
Altematively, the fixed bias can be produced by fixing at a constant value one 
of the input voltages x x , x 2 , ..., x p in the circuit of Figure 9.5(a). 
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diagram and (b) computing node symbol. 


Figure 9.6(a) shows the resistor-operational amplifier implementation of an 
electronic neuron with the output voltage being weighted input voltages. Both 
negative and positive signs of weighting coefficients can be produced in this 
Circuit. Negative weights are formed in the same way as in the configuration of 
Figure 9.5(a). In addition, positive weights are produced by a voltage divider 
circuitry connected to the noninverting input of the operational amplifier. We 
assume here that negative weights w t are associated with inputs x t , for i — 1,2, 
. . . , «, and positive weights v z are associated with inputs y h for / — 1,2,..., p. 

Let us perform the analysis of this network by using the superposition prin- 
ciple. Our objective is to express the neuron’ s weight values and its range of 
linear operation in terms of Circuit component values. We assume momentarily 
that the neuron made of an operational amplifier operates in the linear region. 
According to the superposition technique, /(x, y) can be computed as the sum 
of individual n + p responses, which are the results of each separate excitation. 
The resulting neuron’s weights that need to be computed are shown in symbolic 
form in Figure 9.6(b). Markedly, results obtained in (9.4a) and (9.5a) are already 
available for negative weights. These weights are responsible for computing the 
contribution of inputs x x , x 2 , . . . , x n to the total neuron’s response. 

Let us note that the response due to the individual input y k when all re- 
maining inputs are short-circuited to ground can be computed using the following 
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expansion of the output voltage f(y k ) 



(9.6) 


The first ratio of (9.6) can be computed by setting v + = v_ due to the virtual 
ground property, and by writing the KCL equation for the node labeled A: 


(. f{y k ) - v+) G F -v + iGj = 0 (9.7a) 

;= i 


We thus obtain 

1 + R F £ G/ (9.7b) 

y=i 

easily computed from the voltage division 

for k = 1, 2, . . . , p (9.8) 

Combining expression (9.7b) and (9.8) yields the neuron ’s response f(y k ) due to 
y k from (9.6) as 

/(^a) = ^ ( 1 + £ (9-9) 

\ j =1 / ^j=o8j 

Summarizing the results of the analysis for this elementary neural processing 
node, the final negative and positive weight values expressed in terms of Circuit 
components result, respectively, as 


/(?*) _ 


The second ratio of (9.6) can be 
at node B as follows: 

v+ _ g k 


y k 2 jLo*/ 


w, = -R F Gi, for i = 1, 2, . . . , n 

v*= \1+R f i G,) =&—, fork= 1, 2, .... p 

V j~ l / ^j=o8j 

The total output voltage of the analyzed neuron is expressed as 

/sat- » ^/sat— 

/(x,y) = l net, for/ sat _ < net </ sat+ 

, /sat+ > for net >/ sat+ 

where 

net = w'x + v'y 


(9.10) 


(9.11) 


Noticeably, the assumption of linear operation remains valid only when the con- 
dition/ sat _ < net < / sat+ holds. For a net value outside this range, the performed 
analysis of the neuron of Figure 9.6 remains invalid and the neuron’s output 
latches at one of its saturation voltage levels / sat _ or/ sat+ . 
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EXAMPLE 9.1 


This example illustrates the typical design steps of an electronic neural 
network for cluster detection. Assume that a three-neuron network has been 
trained in the winner-take-all mode as described in Section 7.2. The network 
needs to be able to identify one of the three input vector clusters. Network 
weight vectors that need to be implemented are 



'-2.897' 


'2.121' 


'-1.026' 

W4 = 

0.776 

• = 

2.121 

> w c = 

-2.819 


A network with three operational amplifiers, each using four resistors, as 
shown in Figure 9.7, can be employed to fulfill the design specifications 
(9.12). Detectors of clusters 2 and 3 require two positive and two negative 
weights, respectively, while cluster 1 needs weights of both signs. Let us 
observe that each of the three computing neurons has only two associated 
weights and four or five resistors. Since we have some freedom of choice 
here, all but two weight-determining resistor values will thus need to be as- 
sumed. Design assumptions for further considerations are R f = r 0 = 1 kft. 


Cluster 1 Neuron: Using Equations (9.10) for n = p = 1, we obtain from 
(9.12) 

( -R f G x = -2.897 

{ (1 + R f Gi ) gl = 0.776 < 9 - 13a) 

l 80+81 


Letting R F = r 0 
form 


l/g 0 = 1 kf2 simplifies the expressions (9.13a) to the 


G x - 10 3 = 2.897 

3.897— = 0.776 

lO- 3 + gl 


(9.13b) 


The reader can verify that R { = 1 /G, = 345.2 ft, and r { = 1 /g, = 4022 f l 
are the solutions of (9.13b). This completes the design of the neuron that 
detects cluster 1. 


Cluster 2 Neuron: Using Equations (9.10) for p = 2, we obtain from 
(9.12) 


< 


(1 +R F G l ) 
(1 + RfGO 


8 1 

So + gi + 82 

82 

80+81+ 82 


2.121 

2.121 


(9.14) 


Note that a dummy G, conductance between the inverting node of the op- 
erational amplifier and ground is used here to provide the fixed gain f/v + 
of the Circuit. This gain is now 1 + R F Gy. Since the ratios on the left side 
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R„= lkn 



Figure 9.7 Cluster detecting network with the three resistor-operational amplifiers neurons of 
Example 9.1. 


of (9.14) are smaller than unity, 1 + RpG^ should be chosen in the order of 
tens so that the positive weights can be in the required range. Since R f = 
1 kft, then for 1 +R F G 1 = 10 we obtain /?, = l/G 1 = 111.1 Cl, and (9.14) 
simplifies as follows 

' ^ = 0.2121 

, 8o + 8i + 82 

^ = 0.2121 

So + Si + 82 


(9.15) 
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Since we assume r 0 = l/go = 1 kfi, the solutions of (9.15) are obtained 
as folio ws: 

r { = r 2 = — = — = 2.714 kil 
*1 #2 

This completes the design of a neuron that detects cluster 2. 


Cluster 3 Neuron: Using the first of two equations (9.10) for case n = 2 
we have 


( 9 . 16 ) 


f R F G X = 1.026 
|^G 2 = 2.819 

Again using = 1 kft, the remaining resistances of the network can be 
computed from (9.16) as R x = 1/Gj = 974.6 fl, and R 2 = 1 /G 2 = 
354.7 ft. This completes the design of the entire cluster detecting net- 
work. M 


The techniques described in this subsection apply equally both to neuron 
circuits with zero and nonzero bias levels. Assume that a fixed bias of nonzero 
value is needed for a neuron. This can be achieved by augmenting the input 
vector x by a single fixed component x n+l . This component is used to produce 
the bias signal and can be called the reference voltage. Stage 1 of the circuit 
from Figure 9.8(a) depicts an example neuron with the weights of values —w i9 
for i — 1, 2, . . . , n, and the nonzero bias produced by the input x n+1 . Applying 
relationships (9.10) and (9.11) allows for computation of the output voltage of 
the first stage as 

n 

foM, ■ ■ ■ ,X n+1 ) = I (-H',)x i - x„ +1 (9.17) 

1=1 

As for previous configurations, the stage from Figure 9.8(a) also works with 
the linear transfer characteristics of unity slope, provided / sat _ < fol </sat+- 
Otherwise, the stage saturates at / sat _ or/ sat + for the output voltages that attempt 
to exceed the linear range of an operational amplifier. 

In cases for which a high-gain neuron is needed, the unity-gain neurons, 
examples of which are shown in Figures 9.5, 9.6, and 9.7, should be connected 
in cascade with a high-gain amplifier. Stage 2 of the circuit from Figure 9.8(a) 
provides the gain of value — K, but it also reverses the polarity of all weights. The 
overall transfer characteristics, or activation function of this two-stage neuron, is 
given by the following formula: 

f o2 {\) = K ( i w t Xi + x„ +1 ^ (9.18) 

Again, whenever the value / o2 (x) computed from (9.18) falis outside the range 
(/sat-./sat+)> 1116 neuron’s output saturates at either/ sat _, or/ sat+ . The resulting 
transfer characteristics of the two-stage neuron is shown in Figure 9.8b. It can 
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R F lw { 



Figure 9.8 Neuron of gain K with positive weights and nonzero bias value: (a) Circuit diagram 
and (b) activation function versus X/= i ■*}*/■ 


be seen that the neuron’s bias, or threshold, value is T and has now an opposite 
sign with respect to the reference voltage x n+l , since we have T = ~x n + x . 

In this subsection we introduced the simple electronic model of a processing 
neural node. It is a node that sums a number of weighted input voltages and 
produces an output voltage that is a nonlinear function of that sum. The nonlin- 
ear function of a processing node plays the same role as an activation function 
of a neuron. The activation functions implemented thus far with resistors and 
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operational amplifiers have been piecewise linear with controlled slope near the 
origin, and having symmetrical saturation for large activation levels. 

The resistor-operational amplifier circuits introduced in this section rather 
lucidly illustrate the basic design concepts of the neural network computing 
nodes. Nevertheless, they are of limited practical significance due to several 
important reasons. Some of them are limitations of the shape of the produced 
transfer characteristics, excessive power consumption, and large area occupied 
by discrete resistors and operational amplifiers. In addition, the main problem 
with the electronic neural networks just discussed is that weight values are en- 
capsulated in the values of resistances of fixed, discrete resistors. Such resistors 
are not easily adjustable or controllable. As a consequence, they can be used 
neither for leaming, nor can they be reused for recall when another task needs 
to be solved. 

As will be shown in the subsequent sections, the discussed circuits from 
Figures 9.5 through 9.8 play an important role in more advanced neural network 
implementations. More sophisticated, integrated Circuit neural networks are of- 
ten based on the simple principles just illustrated with resistive networks and 
operational amplifiers. In faet, conversion of input voltages into weighted cur- 
rents highlighted in (9.2) is one of the most frequent intermediate operations 
encountered in electronic neural network implementations. Another commonly 
used technique is the summation of currents into a proportional voltage response 
with nonlinear saturation. More specialized design techniques for electronic neu- 
ral processing nodes are discussed below in more detail. 


9.2 — 

INTEGRATED CIRCUIT 
SYNAPTIC CONNECTIONS 

Artificial neural networks usually contain a very large number of synaptic 
connections and mueh fewer processing neurons. As mentioned before, con- 
ventional resistors can be used to produce synaptic connections in the form of 
weights. Nonlinear voltage amplifiers can be used to accomplish weighted sum- 
mation of input voltages into the activation value, and also produce nonlinear 
mapping of an activation into the output voltage. 

The central problem of implementation of artificial neural networks is to 
make weights that are continuously adjustable, preferably in response to an ana- 
log control signal. In addition, weights should not require many transistors for 
their implementation and for control of their values. Ideally, weights should also 
be capable of leaming under a specified leaming rule. We should realize from 
previous study that the requirement of adjustable weights is easier to implement 
than this one of leaming weights. Indeed, a leaming weight must not only be 
adjustable but it also must respond to a leaming signal according to the training 
rule it implements. 
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Rf 



Figure 9.9 Functional blocks of the basic Circuit of a neuron with fixed weights: (a) resistive 
synaptic connections and (b) current-to-voitage converter. 


Adjustable weights are weights that can be modified off-line as a result of 
the leaming process. Learning weights are understood as weights that modify by 
themselves on-line during leaming. The leaming signal in such cases is provided 
to the weight circuitry that allows the weight modification. The leaming weight 
modification can be an outcome of either the current continuous or step-like 
leaming. Usually, weight modification depends on the neuron’s output signal, 
but it may also involve other quantities. 

Neurobiology solves an on-line leaming problem by using Chemical modi- 
fications of ionic conductances, and by constructing a synapse that is a complex 
signal transmitting device with history-dependent properties. Making artificial 
neural network connections at least partially as elever as the natural connec- 
tions could have great impact on the effeetiveness of analog implementations 
of neural networks (Hopfield 1990). Let us focus, however, on studying mi- 
croelectronic techniques that are presently available rather than on imitating 
neurobiological processes that are too involved to be duplicated in present-day 
integrated circuits. 

As discussed in the previous section, inputs to the electronic model of a 
neural network are assumed to be voltages x for i = 1, 2, . . . , n. To produce a 
weighted sum of inputs, the input voltages are first converted into branch currents 
xfii as shown in Figure 9.9(a). The values G, are branch conductances of resistors 
connected between the fth input and ground. The branch currents must then be 
summed and the sum converted back to voltage using, for example, a current- 
to-voltage converter depicted in Figure 9.9(b). The converter output voltage v 0 
is equal 

v 0 = -i 0 R F (9.19) 

To implement a neural processing node following the description above, two 
functional blocks from Figure 9.9(a) and (b) must be connected in Cascade. Node 
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A needs to be connected to Virtual ground node B instead of to the actual ground 
node as in Figure 9.9(a). This connection enforces i = i Q9 and the output voltage 
value becomes as expressed by (9.3) and (9.4). The two connected networks 
yield a basic neural processing node identical to the one introduced earlier in 
Figure 9.5(a). 

In this section we will study analysis and design of adjustable microelec- 
tronic synaptic connec tions. Our focus is on voltage-controlled resis tors that 
implement such connections. We will see in the following discussion that both 
analog and digital control can be used to produce variations of weight values. 
The performance of leaming synapses that are trained according to the Hebbian 
rule will also be studied. 


Voltage-controlled Weights 

The resistive synaptic connection Circuit with electrically tunable weights 
can be developed using the basic electronic neuron concept of Figure 9.9. As- 
sume that there is a voltage-controlled switch available with performance as 
depicted in Figure 9.10. Figure 9.10(a) shows a symbol of the switch, and the 
typical characteristics of a switch are display ed in Figure 9.10(b). The switch 
can operate between an on State (short-circuit), V AB = 0, and an off State (open- 
circuit), Iab ~ 0. If neither of the two States applies, the switch assumes the 
resistance between terminals A and B, equal to R AB . The value of can be 
made dependent on the controlling voltage V c . It can be seen from Figure 9.10(b) 
that R ab decreases with the increase of the control voltage V c . 



Figure 9.10 Voltage-controlled switch: (a) symbol and (b) graphical characteristics. 
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Figure 9.11 Scalar product Circuit with electrically tunable weights. 


The voltage-controlled switch can be implemented using a single field-effect 
or MOS transistor operating in the resistive (ohmic, also called linear) region. Fig- 
ure 9.11 shows a neuron with n synaptic connections, each of them consisting of 
a single enhancement mode NMOS transistor performing as a voltage-controlled 
switch. With reference to Figures 9.10 and 9.11, terminals A or B of the switch 
can be either the drain or source of the transistor since the device is electrically 
symmetric. Node B is the source for positive input voltage, otherwise node A 
becomes the source, so that the channel current always flows toward the source 
terminal. For this discussion, we assume that the source is at the Virtual ground 
potential and input voltages are positive. The input voltage is thus applied at 
the drain terminal of an enhancement mode NMOS transistor. The channel re- 
sistance R ds , or the resistance between the drain and source of the fth transistor, 
is controlled by the gate to source voltage V gs . 

Let us express the transistor channel resistance R ds as a function of control- 
ling voltage V gs . The drain to source current, l ds , for the transistor depicted in 
Figure 9.12(a) is given by the following expression (see Appendix) 





(9.20) 


for V ds < V gs — V th (resistive region), where k f is the process transconductance 
(typically of value 20 ^tA/ V 2 ), W and L are channel width and length, respec- 
tively, and V th is the transistor threshold voltage (typically between 1 and 2.5 V 
for enhancement mode NMOS devices). 

Figure 9.12(b) shows the output characteristics of an NMOS device oper- 
ating in the resistive region. The characteristics depict I ds (V ds ) with the gate to 
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Figure 9.12 Enhancement mode NMOS transistor as a voltage-controlled switch for weight 
tuning: (a) device symbol, (b) output characteristics in the resistive region, and (c) channel resis- 
tance versus controlling voltage. 


source voltage ^ as a parameter for each curve. Comparing the switch char- 
acteristics of Figure 9.10(b) with the transistor characteristics of Figure 9.12(b), 
the reader may notice that V COFF = V th . Voltage V CON is not defined for the 
NMOS transistor configured as a switch, since R ds is always of nonzero value. 
Noticeably, the first quadrant characteristics of the switch also extend into the 
third quadrant, which makes it possible to apply not only positive but also neg- 
ative input voltages to the synaptic connection. If Kt, <C V gs - V lh , then the 
quadratic-type relation (9.20) becomes approximately linear in and we have 

W 

Ids “ k' T <y gs - V th )V ds (9.21a) 


This linear approximation of (9.20) as in (9.21a) can be considered accurate for 
practical purposes for V ds < 0.5(V^ - V th ). For this condition to be fulfilled, the 
voltage-controlled resistance value R ds equal to the ratio of V ds to I ds becomes 
from (9.21a) 


Rds = 


L 

k'W{V gs - V lh ) 


(9.21b) 


It can be seen that R ds is controllable in a very wide range and V gs can be used 
as the control voltage. As illustrated in Figure 9.12(c), the switch remains open 
for V gs < V th , and then with V gs increasing above V lh , the R ds value drops quickly. 
An additional advantage of this simple configuration is that the value of R ds can 
be scaled by the choice of the channel length to width ratio, L/W. Obviously, for 
large L/W values, the channel resistance is limited to large values. This becomes 
the case for so-called “long channel” or “long transistors.” 
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When applying this controlled switch as a tunable resistance, keep in mind 
that, ideally, the linear resistances of values as in (9.21b) are only obtainable 
at the origin I ds = V ds = 0. Otherwise, slight nonlinearity due to the quadratic 
term in (9.20) is always present, and the weights can only be approximately 
considered as linear voltage-controlled resistances. This should not be a serious 
problem, however, since no great precision of weight values is usually required 
for a neural network operating in the recall mode. Also, only positive resistances 
can be produced by the discussed switch and thus only weights of one sign can 
be implemented by the Circuit configuration of Figure 9.11. 

One possible way of producing both positive and negative weights is to 
use an additional network of positive synaptic weights in the form of a voltage 
divider at the noninverting input of an operational amplifier. This concept was 
illustrated earlier in Figure 9.6 for discrete resistor circuitry. Another possible 
solution is to use complementary output operational amplifiers, which would 
provide both positive and negative weight values by producing output voltages 
of both signs. 

At this point we have gained insight into the analysis and design of neurons’ 
weights, which can be adapted electrically by means of a control voltage. In 
addition to this very desirable feature of tunability, however, voltage-controlled 
weight values should be stored to ensure proper synapse operation over a period 
of time. This issue will be covered following the example below which focuses 
on analysis of a neuron Circuit with electrically controlled weights. 


EXAMPLE 9.2 


Let us analyze the neuron Circuit illustrated in Figure 9.11 with three 
weights. Following are the data on transistors, control voltages, and the 
Circuit elements used to design this circuit. 

Transistors: k' — 20 fiA / V 2 (for all transistors) 


V th = 1 V (for all transistors) 



Control Voltages : V gsl = 5V,^ 2 = 3V, V g # = 3.75 V 

Circuit: R F = 100 kfl 

Our objective is to find the neuron ’s output voltage v 0 as a function of x 
where x = [x x x 2 x 3 ] / . 

Let us initially assume that the drain-source resistances of NMOS tran- 
sistors are exactly linear because V dsi <C V gsi — V th , for i = 1, 2, 3. This 
would require that a small range of input voltages be maintained. With 
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this assumption, the transistors are in resistive regions and their channel 
resistances R dsi , for i = 1, 2, 3, can be determined from (9.21b) as follows 


<LIW)i 


R dsi 20 • 10 _ 6 (V^j, — 1 )’ 


for i = 1, 2, 3 


(9.22a) 


Performing calculations as in (9.22a) for the specified network parameters 
yields the following channel resistance values 


R ds i = 12.5 kft 
Rds2 = 8.33 kO 
R ds 3 = 6.06 kO 


Considering that the feedback resistance R F is 100 kO and that in the linear 
range of operation we have 


v. = -R P 


x i \ X 2 | *3 


\Rds[ Rdsl Rdsi 
then the resulting weight values are as follows: 


(9.22b) 


Wj = —8, >v 2 = —12, vv 3 = —16.5 

Let us now verify the accuracy of the computed weights and determine a 
reasonable operating voltage range for the Circuit. Limiting the range of 
input voltages so that Vdsi s 0 .5(V^ — V th ), the maximum values of input 
voltages for linear approximation of R ds as in (9.21b) become equal: 

Vdsi = \{V g si - 1) (9.22c) 

Given the specified control voltages, V gsi , the values of maximum input 
voltage, V dsi , are 2, 1, and 1.375 V for each transistor, respectively. It is in- 
teresting to compute the actual values of implemented weights considering 
the condition that the input voltage has reached the maximum level beyond 
which the channel resistance cannot be considered linear. 

For such a case, R ds cannot be computed from a closed-form expression 
(9.21b) since its value also depends on the drain current. Using expression 
(9.20) we thus have for the transistor Ml 

I*i = 20 • 10 -6 (1) (4 • 2 — f \ A 

which yields 


hs\ = 120 /xA 


For the maximum input voltage level of value V dsl = 2 V, the resulting 
actual channel resistance is R dsl 

R' dsl = 16.6 kil 
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It can be seen that the discrepancy between the newly computed accurate 
value of the channel resistance R dst and the original approximate value of 
R dsl = 12.5 kfi is 33.3%. This indicates that for the working region of 
input voltages Km < 2 V, the actual channel resistance is between 12.5 and 
16.6 k O, with the resistance error ranging between 0 and 33.3%. In faet, 
the precise value of the resistance depends on the input voltage value. 

The reader can easily verify that similar analysis results in the following 
actual resistance values in the bordercase of input voltages: 

R* 2 = 1U kH for V ds2 = 1 V 
R ’ ds 3 = 8.08 k ft for V ds3 = 1.375 V 

The actual weight values implemented by the Circuit for the maximum level 
of inputs are now resulting in fairly different values than originally specified 
for this neuron which are valid for very small input voltages. The output 
voltage would be now of the value 

V* - -Rp [ l66 kfl + YTTk a + 8.08 kfl/ 

and the weights implemented by this network are w[ = — 6, w 2 — —9, 
and w 3 = -12.37. This example demonstrates that in order to preserve the 
linearity of transistor operation and to produce constant values of weights, 
the input voltages should be kept well below the upper limit of channel resis- 
tance linearity, which has been set at x imSLX = 0.5(V^— V th ), for i = 1, 2, 3. ■■ 


Analog Storage of Adjustable Weights 

Storage of adjustable analog weights is one of the most important problems 
faced in analog implementation of artificial neural systems. The storage form can 
be analog; it would thus have the properties of an analog memory celL Storage 
cells should also be as small as possible since one memory cell would be required 
per every weight (Vittoz et al. 1991) and the overall size of a neural network is 
determined primarily by its number of weights. 

A commonly used storage technique is based on storing charge across a 
capacitive Circuit element. Note that voltage-controlled weights outlined in the 
previous section use the gate of an NMOS transistor as the weight control termi- 
nal. Since no current is needed to drive the transistors’ gates, the simple, natural 
method to store the weight value is to store it as a voltage across the gate 
capacitance (Tsividis and Anastassiou, 1987). The gate capacitance can be either 
intrinsic or intentionally added to the circuit to enhance its storage capability. 

Figure 9.13(a) illustrates the concept of storing a weight as a voltage across 
a capacitor C at the gate of transistor Ml. The weight control voltage v c that 
needs to be stored is sampled by means of a sample-and-hold circuit. Transistor 
M2 shown in Figure 9.13(b) funetions as a sample-and-hold gate and provides 
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OV (virtual ground) 
to neuron 



sample and hold 
Circuit 



(b) 


Figure 9.13 Elementary sample-and-hold Circuit for weight storage: (a) Circuit with sample-and- 
hold switch and (b) all transistor implementation. 


the on/off operation of the switch. The voltage V# directly Controls the channel 
resistance of transistor Ml. 

For an ideal operation of switch M2, exactly tracks the weight control 
voltage v c when the switch is at least partially on. This corresponds to M2 being 
conductive and V “ v e . To ensure the voltage tracking, the switch control 
voltage V g must be well above the level of the weight control voltage v c . This is 
due to the faet that the voltage actually controlling the gate of the switch, ^2. 
has to be in excess of V th2 . This condition ensures that the switch M2 remains 
conductive between the drain and source. By inspection of Figure 9.13(b) we 
obtain 

Vg = v gsl + V gs2 (9.23a) 

For both Ml and M2 conducting, the folio wing conditions must be met, respec- 
tively: 

> V M 
Vgs2 > V lh2 


(9.23b) 
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Assuming that threshold voltages of both transistors are identical and equal to 
V th , we obtain from (9.23a) 

V, > 2V lh (9.23C) 

This conclusion supports an intuitively obvious statement that both transistors 
must operate with their gate-to-source voltages above V*. In such a case, M2 
conducts the current (sampling mode) and adjusts the amount of charge that is 
to be stored in the capacitance C. The voltage across the capacitance, V gsl , in 
tum Controls the weight value. To ensure the analog storage of the weight value 
(hold mode), the open-circuit switch condition V gs2 < V t hi must subsequently be 
enforced to trap the charge at the capacitor. 

In the absence of a positive pulse at the gate of the switch, the capacitor 
C holds the charge, which has been stored during the most recent conductive 
(sampling) period of M2. When the switch is off and the drain and source of 
M2 are disconnected, only a small leakage current I L flows through a reverse- 
biased diffusion-to-bulk junction of M2 (these regions are not shown in Figure 
9.13(b)). As a result, the storage capacitor will slowly discharge. For today’s 
conventional MOS technologies, the retention, or storage, time is of the order 
of several milliseconds. To store the weight control voltage for longer periods, 
the charge stored at the capacitance C needs to be refreshed once every few 
milliseconds. This leakage and refresh phenomena here are analogous to the 
phenomena encountered in the dynamical memory RAM cell, which also needs 
to be refreshed periodically. 

For a number of weight values to be electrically tuned, one control voltage 
generator providing voltages v c can be time-shared by many neurons. As each 
switch closes in tum so that sampling can take place, v c assumes an appropri- 
ate value for charging the corresponding capacitor. Note that the capacitors of 
the various neurons and weights can be accessed using well-known addressing 
schemes from semiconductor memories. The basic difference is that the data 
stored now are in the form of analog control voltage rather than in the binary 
form (Tsividis and Anastassiou, 1987). 

Once the appropriate control voltages have been stored on the gate capac- 
itors, the network is ready to perform recall. Upon completion of recall, the 
storage capacitors can be accessed again, either to refresh the existing weight 
values or to impose new weight values that may result as an outcome of the 
incremental leaming step. 

The natural decay of a capacitor ’s charge that represents the leamed weight 
value is one of the inherent limitations of the analog storage of weights. Es- 
sentially, this is the implementation problem shared by all analog nonvolatile 
memories. Advanced circuit techniques exist that extend the storage time of ana- 
log microelectronic memory through either additional special circuitry or special 
refresh signals. The reader is referred to the technical references for more details 
on special refresh circuitry (Vittoz et al. 1991). Other, alternative techniques for 
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analog memory for weight storage make use of electrically programmable non- 
volatile memory cells. Electrically programmable memory cells will be examined 
in Section 9.4. 


Digitally Programmable Weights 

As stressed earlier in this chapter, one of the most desirable features of neu- 
ron ’s weights is their tunability. In addition to the tunability of weights with the 
use of an analog voltage control signal, weights implemented as integrated circuits 
can be controlled digitally. This technique for weight tuning allows for increased 
flexibility compared to the analog tuning mode due to the faet that weights can 
be both programmed and stored digitally. Weight control data, arranged in digital 
control words, are easily stored in digital registers. Understandably, the accuracy 
of the weight setting in this mode depends mainly on the length of the control 
word. This accuracy limitation is common for any digital-to-analog conversion 
task and is known as the finite resolution and conversion error. Moreover, the 
accuracy of the subsequent analog signal processing is also of importance. The 
key consideration here is that once the weight values have been set digitally, 
all subsequent network computation during the recall mode is performed with 
analog signals. This involves multiplication, addition, and nonlinear mapping by 
the neuron. 

In addition to flexibility and accuracy, the digital tunability of weights pro- 
vides a convenient interface between the training and recall mode of operation 
of the neural hardware. Since leaming often requires considerable accuracy, it is 
usually performed off-line and with the use of high-precision digital computa- 
tion. Weight values computed as a final result of simulated off-line training are 
subsequently programmed on the neural network chip using the digital-to-analog 
interface. In addition, digital weight storage in RAM cells or registers of a digital 
computer is easily available. 

A block diagram of a circuit producing a single digitally programmable 
weight and employing a multiplying digital-to-analog converter (MDAC) is 
shown in Figure 9.14. MDAC circuits are used in many electronic circuits that 
require the use of digitally tunable resistances and a flexible digital-to-analog 
interface (Zurada 1981). The converter circuit shown in the figure performs the 
decomposition of an input voltage x t into binary weighted currents i x , i 2 , . 
i n , which are proportional to jq. The binary weighted currents are then summed 
and converted back into the output vdltage in the current-to-voltage converter 
identical to the one depicted in Figure 9.9(b). 

Assume that register bits of value 0 or 1 represent the entries of the bi- 
nary control word b N b N _ x . ..fej, where the most-significant bit (MSB) is that 
corresponding to b N , while the least-significant bit (LSB) corresponds to b x . 
Although a digital word can be presented in a variety of codes, the one assumed 
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here for expressing the weight value is simply its pure binary representation. 
Figure 9.15(a) shows the Circuit diagram of a weighted-resistor MDAC (Sedra 
and Smith 1991). Successive resistances in the network are inversely proportional 
to the numerical significance of an appropriate bit. It can be seen that b f of value 
0 directs the weighted current to ground by positioning the respective switch S t 
in its left-hand position. Switches for which b { = 1 direct the binary weighted 
currents to the summing node B. Let us note that the currents can be considered 
as binary weighted since node B serves as a Virtual ground. The expression for 
current at the input of the summing current-to- voltage converter becomes 

I = h + å b »-' + --- + ¥k b ' (9 - 24a) 

This expression simplifies to 



(9.24b) 


Let us define the digital control word D as 

D åh + A_ + 

2 n 2 n ~ 1 

and rewrite (9.24) in a more concise form: 

'-£*> 


. + 


(9.25) 


Note that D can take values 0, 1/2^, \/2 N ~ l , 1 - 1/2 N . These values 

uniformly cover the range [0, 1 - 1 /2 N ] with the resolution 2~ N . The resolution, 
for instance, is about 0.1% for a 1 O-bit converter, but it is only 6.25% for a four- 
bit converter. Depending on the scaling required, different R F values can be used 
in the feedback loop of the current-to-voltage converter. Assuming R F = (1 /2)R, 
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Figure 9.15 Circuit diagram of a digitally programmable weight: (a) using a weighted-resistor 
MDAC and (b) using a ladder-resistor MDAC. 
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we obtain from (9.25) and (9.19) 

v* = -Dv å (9.26) 

The neuron’s output voltage (9.26) is the product of the digital control word D 
representing the weight value and of the analog input voltage. That is why this 
particular configuration can be referred to as a programmable attenuator. How- 
ever, by using appropriate scaling and selecting a suitable R f value, the range 
of weight values can be adjusted and increased above the unity value according 
to specific design requirements. The value of R F must be chosen in excess of 
R/ 2 if the weights of magnitudes larger than unity are required. Similar to the 
circuits described earlier in this chapter, however, a large R F causes saturation of 
v 0 at low levels of currents /. In such cases, the neuron’s response will approach 
the bipolar binary activation function and make it resemble the hard-limiting 
response of the threshold logic unit. 

Figure 9.15(b) shows an alternative converter circuit called a ladder-resistor 
MDAC (Sedra and Smith 1991). The network operates on the same principle 
of producing binary weighted currents. Instead of binary weighted resistances, 
however, now a ladder network of R — 2R resistors is used as a current splitting 
structure. We observe from the figure that at any of the ladder nodes, the resis- 
tance is 2 R looking to the right from each node numbered 1, 2, . . . , N. This is 
due to the faet that the horizontal resistance R of the ladder adds with a parallel 
combination of two resistances of value 2 i?, each being connected to ground. 
Thus, such a series connection of R , with 2 R parallel to 2/?, yields the resistance 
2 R. This particular property applies at each of the nodes 2, 3, . . . , N. 

As a result, the current entering each node from the left splits into two equal 
parts, one flowing down toward the switch, the other half toward the adjacent 
node to the right. Current is halved this way in every upper node of the ladder 
network yielding the total number of splits equal to N. However, only the cur- 
rents for which switches are in the right-hand positions contribute to the current 
I entering the node B of the current-to-voltage converter. The output current of 


the MDAC thus has the value 




II 

'b N b N _ l 
,2 4 

+ v\ 

2*7 

(9.27a) 

or 

>- x i° 


(9.27b) 


A formula identical to (9.26) for the output voltage in terms of input x i can be 
obtained if we let the feedback resistance R F of the operational amplifier be of 
value R. 

Clearly, each MDAC configured as discussed and shown in Figure 9.15 can 
implement a single digitally controlled weight. The total of n + p MDACs needs 
to be connected to a current-to-voltage converter arrangement to implement the 
set of n negative and p positive digitally controlled weights. Assume that the 
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(b) 


Figure 9.16 Digitally controlled neuron with programmable bipolar weights: (a) MDAC symbol 
and (b) Circuit diagram for n negative and p positive weights of a single neuron. 


MDAC is represented in the form of a block diagram with the signal input 
voltage at node A, converter output at node B , and the ground terminal at node 
C as shown in Figure 9.16(a). An example neuron’s Circuit diagram can take the 
form of Figure 9.16(b). The neuron shown in the figure implements n negative 
and p positive weights. Operational amplifiers ØA1 and OA2 produce output 
voltages as a function of both inputs and digital Control words. We thus have for 
the output voltages 


Vol = -<*\D X i + x 2 D x2 + . . . + x n D xn ) 


(9.28a) 
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and 


or, briefly 


v o2 - ~(y \ D y\ + yi D y2 + • • • + y p Dyp) 


v oi = - i x Pxi 


1=1 

p 


^ o2 Z y i^yi 

i=l 


(9.28b) 


Operational amplifier OA3 serves as a unity-gain inverter. Linear combination of 
the two inputs — v ol and v o2 is subsequently computed by the inverting amplifier 
of gain —K based on the operational amplifier OA4. We thus have for the output 
voltage 

v, = K(y ol - v o2 ) (9.29a) 

The substitution of (9.28) into (9.29a) yields the neuron’s output voltage deter- 
mined by n negative and p positive weights as folio ws: 

v* = - Z KD xi x t + £ KD iyi (9.29b) 

i=i i = i 

Again, this neuron implements a linear activation function of identity value with 
weights —KD xh for i = 1,2, and KD yi , for i = 1, 2, . . . , p. If, however, 

the output voltage v a computed from (9.29) falis outside the linear output range 
(/sat— >/sat+) °f an operational amplifier, then the neuron’s output saturates at one 
of the levels / sat _ or/ sat+ . 

Both weighted-resistor and ladder-resistor MDACs can be implemented in 
integrated Circuit technologies. Weighted-resistor MDACs require only as many 
individual resistors as bits in contrast to the ladder-resistor MDAC, which requires 
twice as many resistors. However, the spread of resistance values, or the ratio of 
the largest to the smallest resistance for weighted-resistor MDACs, is 2 N . Fab- 
rication of integrated Circuit resistors with extreme yet precise resistance values 
encounters technological difficulties. This limits the use of this type of MDAC 
to usually up to 5 or 6 bits. The ladder-type converter avoids this difficulty 
of extreme resistance values and is more suitable for high-accuracy resistance 
fabrication. 

Let us look at an example of a neural network implementation that directly 
utilizes the described concepts. A neural network integrated Circuit chip has been 
designed and fabricated using standard 3 pm CMOS technology MDAC’s (Raffel 
et al. 1989). The converters are 4-bit with the fifth bit controlling the sign of the 
implemented weight. Converters are addressed using a 1 O-bit address bus, thus 
allowing for handling of 1024 programmable weights. In addition, there are 32 
analog signal inputs to the chip. Up to 16 neurons can be configured using the 
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chip circuitry when extemal amplifiers serving as current-to-voltage converters 
and neurons are added. The chip outputs 16 excitatory and 16 inhibitory currents 
to off-chip current-to-voltage converters. The outputs of the two converters are 
then fed into a differential amplifier-neuron providing analog levels compatible 
with the input levels on the chip. This particular feature enables building layered 
networks. The chip thus has a total of 79 analog and digital inputs and outputs, 
plus power supply lines. It is packaged in an 84-pin package. It occupies an 
active Silicon area of 28 mm 2 and uses 28 500 transistors. 

Let us look at some of the particularly interesting features of the chip. 
Figure 9.17(a) shows the Circuit diagram of a pair of MDACs. The top converter 
produces an excitation current I E , or positive weights; the other produces an 
inhibition current / 7 , or negative weights. The E/I control bit determines which 
of the two converters is active. Weight voltages are encoded in 4-bit register 
Bl through BA. Weighted resistors are built of NMOS transistors biased in the 
resistive mode and controlled at their gates by the binary voltage levels. This 
should ensure the resistors ’ linearity for the limited range of input voltages as 
expressed by formula (9.21). Here, instead of adjusting the R ds of individual 
weighted resistances through design of a suitable ratio W IL of individual tran- 
sistors, all transistors have been made of identical sizes. The resulting weighted 
resistance value is determined by the configuration of transistors controlled by 
each bit. It can be seen that Bl Controls two transistors in series; thus their total 
resistance is 2 R ds . In contrast, BA Controls four transistors connected in parallel 
so their total resistance is 0.25 R ds . In series with the weight-producing transistors 
are switch transistors controlled by th t El I bit. This bit determines whether the 
excitation MDAC (top) or the inhibition MDAC (bottom) contributes the current 
to the output currents I E and / 7 , respectively. 

Figure 9.17(b) depicts the family of measured output current characteristics 
of a single MDAC as a function of input voltage v in . The values of the digital 
control word D serve as parameters for each curve shown. The lowest nonzero 
7(v in ) characteristics is produced for D = 1/16; the top one is for D = 15/16. 
To maintain the linearity of the weighted resistances of the MDAC, the converter 
input voltage v in was kept below 1 V. The output current nonlinearity was found 
to be within 5% for this input range. 

The described chip has been successfully tested in several applications (Raf- 
fel 1989). In the first experiment, a Hamming network has been designed using 
the approach outlined in Section 7.1. The network has been implemented for 16- 
dimensional input vectors representing six different classes (n = 16,p = 6). Each 
of the six neurons in the Hamming network must be provided with bias input; 
therefore, fixed input voltages must be applied in addition to the weighted input 
voltages. The Hamming network requires weights with a single-bit resolution 
only since its binary weights are of values ±1. Six distinet prototype vectors 
labeled A through F, one of each class, have been stored in the weights of the 
tested network. 
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Figure 9.17 MDACs pair for implementing positive or negative weights with 6.25% resolution: 
(a) Circuit diagram and (b) output current of one of the MDACs. [©Lincoln Laboratory Journal, 
reprinted from Raffel et al. (1989) with permission.] 
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Figure 9.18(a) tabulates the results of the recall experiment performed by the 
network. Applied input vectors numbered 1 through 6 have been identical to the 
stored prototypes A through F. Each of the six output nodes labeled A through F 
is associated by the neural network with a stored vector bearing the same label. 
As expected, the coiresponding output neuron’s response A through F is the 
strongest for this set of inputs. This is highlighted on the diagonal of the matrix 
shown containing the highest responses circled and having measured values of 
1130 or 1125 mV. The boxed numbers in the table of responses correspond to 
HD = 8, or a difference of 8 bits between an input and a stored prototype. For 
input vectors 7, 8, and 9 applied to the network, the detected prototypes at the 
smallest Hamming distance value are A, C, and A or B , respectively. 

To test the chip operating with its full 4-bit resolution of weights, the net- 
work was configured as a 22-component vector correlator-classifier (first layer), 
followed by the MAXNET network (second layer and output node). The network 
diagram is depicted in Figure 9.18(b). Twenty two analog input voltages drive 
seven neurons, each of them representing one class, equivalent to a single clus- 
ter. This layer consists of 154 MDACs, one for each weight, connected to seven 
summing current-to-voltage converters. The second layer is configured as seven 
amplifiers, which amplify the difference of the input and feedback voltages. In- 
puts to this layer of neurons are from the first layer neurons, and from the single 
summing output node called here the feedback node. This node provides the 
output sum which is fed back to the inverting inputs of the differential amplifiers 
that constitute the second layer. The second layer of seven neurons and the output 
summing node operate jointly as MAXNET. 

The discussed network configuration has been tested for classification of 
speech samples. The speech samples were the seven spoken numbers from “one” 
to “seven.” The inputs to the network represent two concatenated 11-tuple vectors, 
each containing the spectral components from two 10-ms frames of the spoken 
numbers’ samples. The network has been tested for 182 trials of different spoken 
digits. Seven misclassifications by the network were identified during the test. 

Another implementation of programmable weights using MDACs is reported 
by Moopenn et al. (1990). A chip with 1024 weights utilizes 7-bit data latches 
to store the control words for the 7-bit weighted-resistance MDACs. The pro- 
grammable weights implemented on the chip are of 6-bit resolution; the seventh 
bit is used for producing the weight polarity. To ensure the linearity of weights 
made of the drain-source resistances of NMOS transistors, long channel transis- 
tors are used in this design. Also, the high gate voltage of the switching transistor 
ensures operation within the linear range of drain-source characteristics of indi- 
vidual transistors serving as programmable resistances. The performance of the 
chip configured as a Hopfield network was successfully tested for generating the 
solution of the eight-city traveling salesman problem described in Section 5.6. 
The intercity distances were encoded in the programmable weight values. 
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Learning Weight Implementation 

In our discussion so far, microelectronic analog weight, which transmits the 
signals between neurons, is typically represented by a channel conductance of 
an MOS transistor. The transistor ’s gate voltage is stored across the gate capac- 
itance and performs the weight control function. Learning algorithms, however, 
have been considered so far to be running off-line. In this section we take a 
more challenging approach. Learning examined in this section is regarded as a 
continuous-time on-chip process, occurring while the network is responding to 
the present inputs. The learning signal is not switched on and off; thus, learning 
and recall are concurrent. However, the learning proceeds with its own dynamics, 
which is usually slower than the dynamics of the neural network recall. 

Figure 9.19(a) shows an integrated Circuit weight implementation that makes 
use of this concept. Weight value w# is determined by the weight control voltage 
v c , which is supplied by the learning algorithm. One of the most interesting 
learning modes is Hebbian learning for which the learning signal is simply 
available at the neuron’s output. This section covers the design of weights that 
leam according to the Hebbian rule discussed earlier in Section 2.5. The weight 
adjustments according to this unsupervised learning rule are A w t j — co^j as 
expressed in (2.32). 

The learning weight performs the computation of weight adjustment locally 
at the synapse. The key component for weight adjustment is the product of analog 
input and output voltages, o t Xj. This particular computation can be implemented 
by components called analog multipliers. A number of multiplier configurations is 
available for integrated Circuit implementation (Geiger, Allen, and Strader 1990; 
Allen and Holberg 1987). As we will see below, the analog product computation 
can be implemented by a simple two-transistor Circuit. 

Figure 9.19(b) shows the learning synapse using the simple Circuit of an 
analog multiplier consisting of a pair of identical transistors Af 3, and Af 4 (Card, 
Schneider, and Moore 1991). Transistor Afl performs functions identical to tran- 
sistor Afl on Figure 9.13. The synapse qualitatively approximates the multiplica- 
tive Hebbian correlation of input and output voltage levels, Xj and o r The learning 
current i L is approximately proportional to the product o t Xj when voltage v c = 0. 
This approximation for current is valid conditionally for an existing correlation 
between the variables Xj and It is supported by the folio wing discussion as 
well as by the measurements illustrated below. Assume that current i L starts 
flowing at t — t 0 into capacitor C due to the existing correlation between Xj and 
o The weight control voltage v c increases since 

v c (0 = I / i L (t)dt (9.30a) 

c Jto 

When continuous learning occurs and constant current i L {t) = I L is assumed. 
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Figure 9.19 Electrically tunable weight wf (a) without learning and (b) with learning embedded 
into the weight structure. [Part (b) adapted from Card, Schneider, and Moore (1991); © IEE, with 
permission.] 


voltage v c (t) increases linearly in time since 

v c ( 0 - ^ [v c (t 0 ) + tI L 1 (9.30b) 

Let us examine in more detail how the learning of the weight value is 
accomplished in this circuit. Assume that the levels of neuron outputs exceed 
the magnitude of the threshold voltage of the enhancement mode PMOS devices 
used, thus \ x j\ > V,h and \oi\ > \V th \. Three distinet learning conditions can be 
identified in this circuit as follows: 

1. Both Xj and o i are positive and therefore correlated. For Xj > 0 and o { > 0, 
assume that M3 is conducting and M4 is cut-off. For A/3, M4 we have, 
respectively 

^gs3 ~ Xj O i ^ 

Vg S 4 ~ Xj + Oi > Vfft 
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Both inequalities are fulfilled unconditionally; our assumption has thus 
been correct; therefore, Af3 is indeed conductive and MA remains cut-off. 
Current I { flows through Af 3, provided that the diode is forward biased. 
This, in tum, is only guaranteed as long as the voltage v c remains small 
enough, or o t large enough, to keep the diode conductive. 

Under certain conditions, M3 is found to be saturated and thus I x 
remains constant in time and independent of v c , which increases linearly in 
such circumstances. Since the condition for saturation of a PMOS device, 
V ds < V gs - V th , is equivalent to V d < V g - V th , we obtain the following 
condition for saturation of Af 3: 

v d < -Xj ~ V th 

or 

v c + V < -Xj - V th 

The above formula indicates that M3 indeed remains saturated below a 
certain level of v c . Otherwise, the condition is not fulfilled and M3 is 
operating in the resistive range. 

2. Both Xj and o, are negative and therefore correlated. For Xj < 0 and 
o, < 0, assume that M3 is cut-off and M4 saturated. For M3, MA we have, 
respectively 

Vga = ~ x j ~ °i > v th 
V & 4 = Xj + o i < V th 

Again, this proves that the assumption about the operating regions of 
Af3 and Af4 is correct; thus, Af3 is indeed cut-off and Af4 is conductive. 
Current / 2 flows through Af4 provided again that the diode is forward 
biased. Similarly as in case 1 the voltage v c can be shown to increase as 
a result of this process due to the existing correlation of Xj and o r Again, 
as in the previous case, / 2 remains constant in time for MA saturated. The 
reader can verify that Af4 is saturated when 

v c + V<Xj- V th 

Otherwise, Af4 operates in the resistive range if this condition is not met. 

5. For Xj < 0 and o i > 0 with an additional assumption of \xj\ = we 
obtain V gs3 = V gs4 = 0. Both transistors are thus cut-off. As required, no 
leaming takes place in this case and I L = 0, since x y and o { are obviously 
anticorrelated. 

Let us summarize the leaming progress in this Circuit. As the capacitor 
charges due to the computed correlation of o t and Xj, the weight w t j increases. 
The voltage at the node D is equal to v c plus the forward-biased diode voltage 
drop V. It can be noticed that for a certain level of v c + V , the voltage v c . ceases 
to increase and i L stops flowing even though the product o { Xj is positive. Thus, 
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the leaming signal represented by i L becomes affected by the acquired weight 
value Wy(0 and the weight leaming process slows down, or even vanishes. It 
can therefore be seen that this Circuit implements the Hebbian leaming with 
saturation, because high weight values slow down further leaming. Transistor 
Ml is used to control the leakage current, which now produces the intentionally 
embedded weight forgetting property of this Circuit. The decay of weights because 
of leakage current through Ml occurs much slower than the leaming so that the 
weight can memorize a large number of inputs / output pairs. The diode allows 
for decoupling of the leaming and forgetting process so that their rates may be 
independently adjusted. The forgetting rate is usually several orders of magnitude 
slower than the leaming rate. 

The dimension of each synapse from Figure 9.19(b) implemented in \2-fim 
CMOS technology is 115 X 10.9 fim . The smallest length of the CMOS transistor 
channel in this technology is 1.2 fim. A synapse chip with an area of 1 cm 2 
could contain approximately 75 000 individual leaming weights. SPICE simu- 
lations have been performed for the weight model discussed (Card, Schneider, 
and Moore 1991). Figure 9.20(a) depicts the current ^ through the weight as 
a function of the weight input voltage Xj with v c being the parameter. Figure 
9.20(b) shows the leaming current i L versus o i9 with Xj being a parameter for 
each of the four characteristics shown. The figure corresponds to the case v c = 0 
and therefore represents the maximum leaming rate. It can be seen that the pair 
M3 and MA performs well as an analog voltage multiplier, except for values o t 
and Xj near the origin. The graphs reveal that small correlations existing between 
o l and Xj are ignored by this multiplier. 

Figure 9.20(c) illustrates the weight control voltage curves, v c , versus current 
i L . The figure displays four different positive correlation leaming cases between 
Oi and Xj : (5 V, 5 V), (5 V, 3 V), (3 V, 5 V), (3 V, 3 V). Note that the fastest leaming 
is observed for (5 V, 5 V); however, faster leaming occurs for case (5 V, 3 V) than 
for case (3V,5V). The figure clearly illustrates the leaming with saturation by 
this weight circuit. It can also be seen, that the weight decay starts at t = 30 ns 
when the leaming ceases and a slow weight value decay begins due to the 
discharge of C through Ml. 


~ ACTIVE BUILDING BLOCKS 
OF NEURAL NETWORKS 

Although the number of weights usually outnumbers the number of comput- 
ing nodes, and the majority of the artificial neural network circuitry is composed 
of connections, an important part of the processing is performed outside synapses. 
Specifically, functions such as summing, subtracting, computing scalar products, 
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Figure 9.20 Simulation results for learning weight of Figure 9.19(b): (a) weight current versus 
input voltage, (b) learning current versus output voltage, and (c) learning rate and saturation 
phenomena (t < 30ns); followed by forgetting through weight decay (t > 30ns). [Adapted from 
Card, Schneider, and Moore (1991); © IEE, with permission.] 


or nonlinear mapping are for the most part performed independently of synaptic 
weighting. A number of electronic functional circuits such as current mirrors, 
inverters, differential amplifiers, transconductance amplifiers, analog multipliers, 
and scalar product circuits can perform these operations. Integrated circuits able 
to implement these functions and designed using CMOS technology are covered 
in this section. To avoid duplication and maintain our focus of study on neural 
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networks, the coverage of these building blocks is less comprehensive in this 
text than in more advanced microelectronic texts. Primary attention is devoted in 
this chapter to the circuits’ applications in artificial neural systems rather than in 
their typical switching or amplifying configurations. For a more comprehensive 
study of building block Circuit analysis and design, the reader may refer to a 
number of specialized microelectronic circuits texts available (Geiger, Allen, and 
Strader 1990; Allen and Holberg 1987; Millman and Grabel 1987; Sedra and 
Smith 1991). 


Current Mirrors 


A current mirror is one of the most essential building blocks of many analog 
integrated circuits, with artificial neural systems being no exception. A current 
mirror takes its input current and creates one or more copies of it at one or more 
of its outputs. The term current amplifier is also used for a current mirror since 
an input current change results in an output current change. 

Figure 9.2 l(a) shows an example of a simple current mirror consisting of 
two matching NMOS transistors, Ml and M2. The current mirror produces out- 
put current 7 0ut proportional to the input current 7 in so that 7 0Ut = a/ in , where 
a is the proportionality constant dependent on the transistor sizes and the fab- 
rication process constants. For practical purposes, the relationship 7 0ut = al in 
should be maintained in large range of input current variations. Note that since 
V gsl = V ds \ , then the condition V dsl > V gsl — V thX is always fulfilled and, there- 
fore, Ml is permanently saturated. Assuming that an additional condition is met, 
V M , both devices can be considered to be saturated. We 


y ds2 


= V ft 


> v a 


gs2 


thus obtain the mirror output /input current ratio (see Appendix): 

/ out _ k'(W 2 ILtfy g q ~ VW) 2 a + AV^z) 


(9.31a) 

An k\WJL y ){V gs t-V^m + W^) 

where A stands for the channel length modulation effect. In our previous dis- 
cussion of MOS transistor modeling, A has been assumed to have a zero value 
without making a significant error. For the current mirror study, however, the 
simplifying assumption A = 0 may lead to essential inaccuracies of the analysis. 
Since V gs2 - V gsl and V lhl - V lh2 for matching transistors, we obtain from (9.31a) 


J OUt _ 


w 2 /l~ 

WJL, 


/ 1 + A V*2 \ 

VI + A V ds J 


(9.31b) 


Ideally, / ou t/hn should only be a function of the aspect ratios (W/L) of both 
transistors. For identically sized transistors, and V dsl = V«2, input and output 
current are then exactly equal. In such an ideal case, the current gain value a is 
equal to unity. In the case of K/s2 ^ V dsl , however, the channel length modulation 
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Figure 9.21 Current mirror with NMOS transistors: (a) simple Circuit diagram and (b) example 
/out = f(/j n ) characteristics for unity-gain case [(W^IL^) = (Wfe/ia)]- 


factor A causes the output current to be slightly different than the input current. 
For A = 10~ 2 F -1 (typically), 7 0ut displays then a slight deviation from / in that 
can be computed from (9.31b) for given values of drain-to-source voltages for 
both Ml and M2. 

Figure 9.2 l(b) depicts the unity-gain current mirror characteristics. The 
current mirror operating area should be maintained in the saturation region of 
transistor M2 . Therefore, v out must satisfy the condition v out > V gs2 - V th2 , which 
easily translates into the following inequality 



where k 2 = k'W 2 /L 2 . 

The output characteristics of the current mirror are shown in the figure as 
horizontal lines for the idealized case when A = 0, thus indicating that the device 
reproduces the input current at its output in a wide range of output voltages v out . 
A current mirror therefore behaves as both current source and current replicator. 
For more realistic MOS transistor modeling with A =£ 0, the output current 
characteristics in the saturation region have nonzero slopes. This is shown in 
Figure 9.21(b) with dashed lines. It can be seen that the output current produced 
by the current mirror also depends to some extent on its output voltage. As a 
result, the device can therefore be considered a nonideal current source. 
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Figure 9.22 Current mirror use for computing the difference between the excitatory and 
inhibitory currents. 


Current mirrors are ideally suited for detecting the differences of two cur- 
rents. Specifically, they can be applied in artificial neural networks to compute 
the difference between an excitatory and an inhibitory current. This function can 
be performed by the unity-gain current mirror loaded with a current-to-voltage 
converter as shown in Figure 9.22. Noting that the current / to be converted is 
equal to the following current difference 

I = I + - / out (9.33a) 

and using the current mirror property of /_ = 7 0ut we obtain 

/ = /+-/_ (9.33b) 

Thus, the converter output voltage is of value 

v 0 = (/ + - 1_)R f (9.34) 

T his result is valid provided the operational amplifier remains in the linear region 
and does not saturate. Otherwise, output voltage v Q takes either the positive or 
negative saturation value, / sat+ or / sat _ , respectively. 

Three main reasons make the actual current mirror perform as a nonideal de- 
vice: the channel length modulation effect, threshold voltage differences between 
transistors used, and imperfect geometrical matching of the fabricated transis- 
tors. Figure 9.23 shows three circuit diagrams of improved current mirrors. They 
are cascode (a), Wilson (b), and modified Wilson (c) types (Millman and Grabel 
1987; Geiger, Allen, and Strader 1990; Sedra and Smith 1991). All of the current 
mirrors shown have distinctly more constant current / out versus v out , and reduced 
channel length modulation effect compared to the original simple two-transistor 
circuit from Figure 9.2 l(a). 
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Figure 9.23 Basic configurations of MOS current mirrors: (a) cascode, (b) Wilson, and 
(c) modified Wilson 


Inverter-based Neuron 

Most neuron elements presented in this chapter have consisted of a suitably 
coupled operational amplifier Circuit that simultaneously performed current sum- 
mation, current-to-voltage conversion, and nonlinear mapping. We have noticed, 
however, that the activation functions of neurons designed so far with operational 
amplifiers are of limited form only. As discussed, they are (1) linear, (2) of finite 
gain, with the gain value determined by the slope of the linear part of the char- 
acteristics near the origin, and (3) beyond the linear region the characteristics 
saturated at / sat + and / sat _ levels. In summary, all activation functions produced 
thus far have been either linear or piece-wise linear. 

In numerous applications, however, the neurons’ characteristics are required 
to have a sigmoidal shape of the activation function. “Sigmoidal” is used here 
in the sense of any smooth S-shaped curve, and it does not refer to any specific 
function. Such characteristics can be obtained as voltage transfer characteristics of 
a simple complementary Circuit known in electronics as a “push-pull amplifier.” 
This Circuit behaves as a voltage inverter and is often used in logic circuits. 

The CMOS inverter circuit is shown in Figure 9.24 and consists of com- 
plementary NMOS and PMOS transistors with drains tied together that also act 
as the output terminal. Gates are also connected to form the input, and sources 
are tied to two opposite supply voltages V DD and —V DD , where V DD > 0. 
The circuit performs as an inverting voltage amplifier with considerable voltage 
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Figure 9.24 Inverter-based half-neuron Circuit diagram. 


gain near the origin. To derive the activation function for this circuit, we assume 
that there is full electrical symmetry between the NMOS and PMOS devices, or 
k n = k p . Symmetry also requires that both transistors’ threshold voltages be of 
the same magnitude, which means V thn — ~V thp . 

To obtain the voltage transfer characteristics v out (v in ) for this circuit let us 
use the equation 

hsn + hs P = 0 (9.35) 

valid for the case of open-circuit inverter output. It can be seen that each of the 
transistors can be in one of the three regions: cut-off, resistive, or in saturation 
(see Appendix). Although the number of combinations of regions for two tran- 
sistors in the pair is nine, only five pairs of stable States can exist in the entire 
range of input and output voltages (Zurada, Yoo, and Bell 1989). 

To construct the output characteristics, we start with large negative input 
voltages v in such that v in + V DD < V thn . Since by inspection of the figure we have 
for the NMOS transistor 

Vgsn = v in + Vdd (9.36a) 

thus Ml remains as the cut-off. However, at the same time transistor M2 is fully 
on since its gate-to-source voltage is below v Indeed, we have for M2, 

V gv = v in " Vdd < V lhp (9.36b) 

However, since no current can flow through any of the two transistors con- 
nected in series due to the cut-off of Ml, transistor M2 behaves as a short circuit 
between the drain and source. Also, the following condition holds for the resistive 
region of M2: 


Vdsp > Vgsp + Vfhp 


(9.37) 
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We thus have for this region 


Vout = V DD , for V in <-V DD + V thn 


Figure 9.25(a) shows the voltage transfer characteristics of this Circuit. The 
discussed part of the characteristics is marked AB. In this operating range of the 
neuron transistor, Ml is the cut-off and M2 is in the resistive region. 

When the input voltage reaches the level slightly above VdD + V thn , the 
NMOS device enters the saturation region, the PMOS device remains in the 
resistive region and the channel currents start flowing with the following values: 

hsn = \{V gsn - V thn f (9.38a) 


I dsp k 


(Vgsp ^thpWdsp 


V 2 ' 

v dsp 


(9.38b) 


Let us now notice that (9.38a) and (9.38b) remain valid for the following con- 
ditions, respectively: 


v dsn > Vgsn ~ v thn saturated) (9.39a) 

and 


Vdsp > V gsp - v thp (M2 resistive) 

(9.39b) 

Using identities labeled on Figure 9.24 and expressing v in , and v out in terms 
of the transistors’ terminal voltages and supply voltages allows for the expression 
of all voltages between transistor terminals as follows 

Vdsn ~ ^out Vqq 

(9.40a) 

Vdsp ~ V out V[)i) 

(9.40b) 

Vgsn ~ ^in Vp)D 

(9.40c) 

Vgsp ^in Vp)D 

(9.40d) 


Conditions (9.39) for the transistors’ operating regions can now be rewritten to 
the form 


v out > v in _ Vthn saturated) (9.41a) 

v out > v in - V,hp (Af2 resistive) (9.41b) 

The line v out = v in — V thn obtained from (9.41a) separates the resistive and satu- 
ration regions of M2 as shown in Figure 9.25(a). A straight line v out = v in - V thp 
obtained from (9.41b) is also shown in the figure. The line passes through point 
C and separates the saturation and resistive regions of Ml. 

Our objective in this step is to find v out versus v in for Ml saturated and M2 
resistive, or beyond point B of the characteristics. This part of the input /output 
curve is responsible for the sigmoidal part of the neuron’s activation function. 
Using relationships (9.40) again allows for rearranging expressions for currents 
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Figure 9.25 Voltage transfer characteristics of an inverter-based neuron (device and power 
supply symmetry assumed): (a) v out versus v jn (A = 0) (ideal transistors case), (b) regions of 
transistors operation, and (c) v out versus v in , (A * 0) (nonideal transistors case). 
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as in (9.38) to the form 

hsn = + Vdd - V'hn ) 2 (9.42a) 

Idsp = -* (v in - V,h P - — \ — ) (Vom - Vdd) (9.42b) 

It now follows from (9.35) that (9.42) can be rewritten as: 

^(v, n + Vdd ~ V tf J = (v in - V thp - V ° ut y (v out - V DD ) (9.43) 

This relationship is somewhat unwieldy for hånd calculation but does provide the 
desired relationship between v in and v out in the concemed region of operation of 
the neuron. The relationship is graphed as part BC of the characteristics in Figure 
9.25(a). When v in reaches 0 V, the characteristics at point C crosses the region 
borderline and transistor M2 enters the saturation region. This results in two 
transistors being saturated simultaneously. The output voltage remains undefined 
in this CD part of the characteristics and — V thn < v out < — V thp for v in = 0. 

A summary of the operating regions for both transistors covering the com- 
plete characteristics is illustrated in Figure 9.25(b). The remaining part DEF of 
the characteristics can be obtained from a similar analysis using data in the two 
bottom rows of Figure 9.25(b). The discussion above makes use of the assump- 
tion that the channel length modulation effect is negligible in both transistors, or 
A = 0. For a more realistic assumption A^OWe obtain the inverter character- 
istics as shown in Figure 9.25(c). The main difference here is that in contrast to 
the ideal case, this curve is of finite slope in the region CD when both Ml and 
M2 are saturated. 

It is obvious that connecting two inverting half-neurons from Figure 9.24 in 
cascade yields a complete neuron element. The resulting input /output character- 
istics of the cascade connection is of the noninverting type and has a sigmoidal 
shape. This neuron configuration is especially well suited for building associative 
memory circuits or continuous-time gradient-type networks. Figure 9.26 depicts 
an example of a circuit diagram of an n - bit single-layer recurrent autoassociative 
memory using a cascade connection of two push-pull inverters as neurons. Due 
to the availability of both v, and inverted output v h positive as well as negative 
weights can be implemented in this circuit. The weight matrix of the memory 
network shown contains weights made of single-transistor resistances and con- 
trolled by the gate voltages. The transistors of this circuit need to operate in 
their linear region to maintain constant weight values as has been discussed in 
Section 9.2. 


Differential Voltage Amplifiers 

Throughout this chapter we have most often used integrated circuit opera- 
tional amplifiers as versatile building blocks of microelectronic neural systems. 
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An operational amplifier is essentially a differential voltage amplifier of very high 
gain with voltage output. As we have seen, the negative feedback loop of the 
operational amplifier has to be closed to achieve its stable operation. However, 
differential voltage amplifiers exist that can operate without the negative feed- 
back loop and their voltage gain is typically a hundred or more times lower than 
that of operational amplifiers. Such differential voltage amplifiers are very often 
used as input stages of operational amplifiers. This section is concemed with the 
properties of such differential voltage amplifiers. It also covers the basic concept 
of the transconductance amplifier. These two simple and closely related circuits 
have characteristics that make them desirable budding blocks of microelectronic 
neural networks. 
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Figure 9.27 Basic differential voltage amplifier with an NMOS pair: (a) Circuit diagram and 
(b) normalized transconductance characteristics for normalized differential input voitage. 


Figure 9.27(a) shows the schematic of an NMOS differential voltage ampli- 
fier called briefly the differential pair. The Circuit consists of a pair of matching 
NMOS transistors Ml and M2, two possibly symmetric loads, and a reference 
current source /„. Let us examine the pair’s transfer characteristics, noting that 
the circuit should primarily serve as an amplifier of the differential voltage signal 
v ind defined as follows: 

v ind å V gsl -V gs2 (9.44a) 

which is equal to 


^ ind ^ gi 


(9.44b) 


Assume that transistors Ml and M2 are always in saturation, which is a reason- 
able assumption as we will see later from subsequent discussion and from the 
lines separating operating regions on Figure 9.28(a). For the saturation condition, 
we have the drain-to-source current of both transistors: 


hi = j(V g s, - V M ) 2 , for i = 1 , 2 


which leads to the following expression for and 



(9.45b) 
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Figure 9.28 Voltage transfer characteristics of a differential input amplifier: (a) single-ended 
output configuration and (b) double-ended output (differential output) configuration. 


This allows us to express v ind from (9.44a) as a function of currents I dl and I d2 : 



(9.46) 


provided V M = v th2 , and k { = k 2 = k for matching transistors. Observing that 
the reference current source establishes the sum of both channel currents I dsl and 

Idsi as e 9 ual to I SS 

Iss = Idl + Idl 


( 9 . 47 ) 
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and forming a quadratic solution of (9.46) and (9.47) for hx and hi leads to 

1/2 

| (9.48a) 


i hs , hs 

ldX - T + 2 


' kv 2 k 2 4 > 

KV ind _ K lind 


4/?„ 


j _ *ss _ 

l di — 


' kv\ 


ind 


k 2 v 4 
Z V ind 

4P 


1/2 


(9.48b) 


Due to the assumed saturation condition, expressions (9.48) are valid for the 
following limited range of differential input voltages: 





(9.49) 


The transfer characteristics l d \ versus v ind and I d2 versus v jnd can now be sketched 
using (9.48) in the region of v ind specified by (9.49). Figure 9.27(b) shows a plot 
of the normalized drain currents through transistors Ml and M2. 

Assume that the loads of the driving transistors Ml and M2 from Figure 
9.27(a) are resistors of an arbitrary value R L . It can be seen that the output 
voltages v dx and v d2 can be expressed as 


v di - Vdd hx*L 
v d2 = VdD ~ hlh^L 


(9.50) 


Equations (9.50) are plotted in Figure 9.28(a). The graphs depict the voltage 
transfer characteristics of a single-ended output differential-input amplifier with 
outputs being either v d] or v d2 . The saturation regions for both transistors are also 
graphed on the figure. The equations of lines separating the transistors’ operating 
regions are 


^ind _ y 
v d\ ^ V th 


Vd2 = 


Yjnd 

' 2 


-v t > 


(9.51) 


and this indicates that the initial assumption about the two transistors being 
saturated is fulfilled in a rather large range of the differential input voltages. Let 
us also notice that the amplifier output can also be configured in the differential 
mode so that v out — v d2 — v dl . Such a connection ensures that the range of the 
output voltage covered by this configuration is between —1 ss Rl and l ss R^ This 
can be seen from the double-ended differential-input characteristics shown in 
Figure 9.28(b). 

Although the double-ended, or differential output, amplifier provides a desir- 
able shape for the transfer characteristics and its gain is twofold compared to the 
single-ended stage, none of its two output terminals are grounded. To circumvent 
this rather impractical load configuration, a current mirror can be used to form 
the load. The advantage of such a configuration is that the differential output 
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signal in the form of drain currents in the Circuit of Figure 9.27(a) is converted 
to a single-ended output with the load grounded at one terminal. 

The Circuit configuration implementing this concept is shown in Figure 
9.29(a). In this Circuit, the output voltage or current is taken from the connected 
drains of transistors M2 and MA. The most essential feature of this arrangement 
is that by definition of the current mirror as in (9.33) we have in this circuit 


/out = hs - hi (9.52) 

This circuit providing the output variable of current / out is also called the 
transconductance amplifier. It behaves like current source controlled by the dif- 
ferential input voltage v ind . 

Expressing currents I dl , I d2 as in (9.48) allows for rewriting (9.52) as fol- 
lows: 


1 OUt 


'kv 2 

_ind 


k 2 V 4 ' 
K * ind 

Afi 


1/2 


The transconductance of this circuit, g m , is defined as 

dd «) | 


Sm 


d ( v ind ) 


v incl=° 


(9.53) 


(9.54) 


Computation of g m from (9.54) using (9.53) indicates that the transconductance 
g m of the amplifier is identical to the transconductance of a single transistor Ml 
or M2 of the pair (Geiger, Allen, and Strader 1990) and is equal to 


Sm \/^/w 


(9.55) 


If the load of the transconductance amplifier is resistive, then the circuit 
performs like an ordinary differential voltage amplifier with the voltage gain of 
value 

= gJt, (9.56) 

d(v ind ) 8m L 

The symbol of the transconductance amplifier is shown in Figure 9.29(b). It 
can be seen in this figure that the amplifier load is resistive. The differential 
voltage gain expression of (9.56) is only valid near the origin v ind = v out = 0. 
Similar to the transfer characteristics from Figures 9.27(b) and 9.28(b), the char- 
acteristics of the transconductance amplifier with resistive load also becomes 
strongly nonlinear for larger differential inputs v ind and eventually it saturates. 
Typical curve v out versus v ind is shown in Figure 9.29(c). Since the currents of 
the differential amplifier in saturation are limited to ±I SS , the output voltage of 
the loaded transconductance amplifier is limited to between ~I SS R L and I SS R L . 

The transconductance amplifier has several distinet characteristics differ- 
ent from the conventional high open-loop gain differential operational amplifier. 
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Figure 9.29 Transconductance (voltage-to-current) amplifier: (a) Circuit diagram, (b) symbol of 
the transconductance amplifier ioaded by a resistance R L , and (c) voltage transfer characteristics 
for the Circuit of part (b). 
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They are: 

1. The voltage gain level of an operational amplifier is controlled by the ratios 
of resistance in the feedback loop to the resistances between the signal 
input and the operational amplifier input. In the transconductance amplifier, 
the gain is proportional to g m , which is controlled by the intemal supply 
current I ss , as can be seen from expressions (9.55) and (9.56). 

2. The voltage gain of the operational amplifier does not depend on the load 
resistance R L as is the case for the transconductance amplifier [see Equation 
(9.56)]. This difference is due to the faet that the operational amplifier 
performs as a voltage source, and it has output stages with low output 
resistance. The transconductance amplifier performs as a current source 
producing current of certain value; its output voltage is thus proportional 
to the load resistance value. 

5. The input /output voltage or current transfer characteristics of the 
transconductance amplifier characteristics have a sigmoidal shape. The 
transconductance amplifier computes an activation funetion with a smooth 
transition from the linear behavior to the behavior that is saturated. The 
counterpart characteristics of the operational amplifier are, rather, linear 
next to the origin and then displaying flat saturation far from the origin. 

The simplicity of the design and the sigmoidal shape of the produced transfer 
characteristics make the transconductance amplifier a versatile building block of 
neural networks as will be shown later. One additional desirable feature of the 
transconductance amplifier is its high input resistance. 


Scalar Product and Averaging Circuits with 
Transconductance Amplifiers 

Transconductance amplifiers can be used to process voltage input signals 
and are able to produce current or voltage outputs. The Circuit can be configured 
in a number of ways depending on its intended funetion. Figure 9.30(a) shows 
the voltage follower Circuit using the transconductance amplifier with a negative 
feedback loop closed. Its output current i Q is proportional to the difference be- 
tween the output voltage and the input voltage and is expressed by the following 
formula: 

io = 8m(v in - v o) (9.57) 

Substituting i Q = v 0 /R, the expression for the voltage gain of the voltage follower 
Circuit becomes 

Vq _ Sm R 
Vin 1 + 8m R 


(9.58a) 
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Figure 9.30 Useful configurations of transconductance amplifiers: (a) voltage follower and 
(b) follower-aggregation Circuit. 


If the follower is connected to a high-resistance load and g m R » 1, the output 
voltage approximately follows the input voltage since we have 

^ s l (9.58b) 

^in 

The most attractive property of the voltage follower as an electronic Circuit 
building block is that the output voltage follows the input. However, no current 
is taken at its input, thus the circuitry that provides the signal v in to the follower ’s 
input is not loaded due to the follower’ s very large input resistance value. Also, 
the follower ’s output voltage depends only on the input voltage and remains 
invariant under variable load conditions. 

Several voltage followers can be connected to compute a scalar product or 
the average value of several input voltages (Mead 1989). The Circuit diagram for 
such a computational function of interconnected transconductance amplifiers is 
depicted in Figure 9.30(b). The configuration is called the follower-aggregation 
Circuit. Each amplifier i has a transconductance g mi . Each voltage follower sup- 
plies an output current proportional to the difference between the input voltage 
and the output voltage. The contribution of each input v ini is weighted by the 
transconductance of the associated amplifier. 
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For the output node of voltage v 0 we can express the total current as a sum 
of contributions g mi (v ini — v 0 ), for i = 1, 2, ...» n. Since no current leaves the 
output node we have 


n 


S Smi^yini ® 

i= 1 


Rearranging the terms of (9.59) and computing v G yields 


(9.59) 


_ Smi^ini 
1U g mi 


(9.60) 


Equation (9.60) shows that the output voltage v 0 is the weighted average of inputs 
v im , and that each input voltage is weighted by its transconductance. Thus, the 
follower-aggregation circuit shown in the figure computes the weighted sum of n 
input voltages, with transconductances g mi serving as the weighting coefficients. 
This function can also be interpreted as computing the average value of an en- 
semble of inputs. In another interpretation, the output voltage v 0 as in (9.60) can 
be understood to be a scalar product of weights and inputs, with the expression 
(X/L i gmi) serving as the proportionality constant. 

Formula (9.60) is only valid for the range of input voltages v in/ that en- 
sures the linear operation of each transconductance amplifier. If one of the 
input voltages v ini is of excessive positive or negative value, the correspond- 
ing transconductance amplifier saturates and does not contribute beyond its bias 
current. This feature limits the effect of a possible faulty or out of bounds value in 
the averaging of a large number of signals. Thus, the voltage v 0 will not follow a 
few off-scale inputs. The discussed implementation of the averaging circuit using 
voltage followers has great robustness against bad data points (Mead 1989). 


Current Comparator 

A current comparator circuit offers another useful alternative solution to the 
task of summation of inhibitory and excitatory currents, which are produced in 
synaptic connections at the neuron ’s input. The comparator sums the inhibitory 
and excitatory currents flowing through the weights and performs the nonlinear 
mapping of their difference into the output voltage. The comparator ’s output 
voltage may be required to have sigmoidal characteristics as a function of a 
differential current 7 + — Figure 9.31 shows a typical circuit diagram for 
a CMOS current comparator. It uses two p-channel current mirrors consisting 
of transistors Ml, M2 and M3, M4, one n-channel current mirror consisting of 
transistors M5, M6, and one CMOS inverter as a thresholding element (Bibyk 
and Ismail 1990). The inverter, composed of M7 and M8, should preferably be 
electrically symmetric. 


M4 


Excitatory 

currents 


(a) 



I + /- 

(b) 


Figure 9.31 Current comparator: (a) Circuit diagram and (b) voltage versus inhibitory current 
characteristics. 
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The sum of all excitatory currents, /+, is mirrored through the w-channel 
current mirror, and it flows through M6. At the same time the sum of all inhibitory 
currents flows out from transistor M 2. If /+ = /_ , then M2 and M6 will source 
and sink, respectively, the same amount of current. The output voltage in such a 
case is equal to V DD /2. Any imbalance of currents such that /+ < /_ causes the 
output voltage v Q to decrease from its quiescent, center value of V DD / 2. This is 
shown in Figure 9.3 l(b). When I + > /_, a reverse situation occurs and the output 
voltage increases. For proper operation of this Circuit, the sizing of transistors 
must be such that the current mirror transistor in each of the three mirrors must 
remain in saturation during the entire operating range of total excitatory and in- 
hibitory currents. Each of the currents consists of n synaptic component currents 
produced by the input component signal weight. In addition, the current mirror 
design has to provide for accurate balance such that v 0 = V DD /2 at /_ = /+. This 
is ensured when M2, M6 and Ml, M 8 are electrically identical transistor pairs. 


Template Matching NetWork 

As discussed in previous chapters, template matching represents one of the 
frequently used forms of neural processing. Template matching can be of partic- 
ular importance for handling images in bit-map form, and is especially suitable 
for preprocessing of the binary data. Template matching networks are typically 
required to decode binary rc-tuple vectors and output the logic level that is unique 
and different from any outputs that are produced for rø-tuple inputs that do not 
match the specified template. From this standpoint, an output of the template 
matching network can be expressed as a combinational Boolean function. Per- 
haps the most suitable Boolean function for this task is called an equivalency 
function. It is defined as a complement of an XOR function. A number of tech- 
niques exist to design Boolean functions of interest. Although the operation of 
matching using these well-known techniques can be easily implemented using 
conventional digital logic gates, we will focus in this section on a design that 
is analog in nature and possibly involves only a single transistor per bit to be 
matched (Chung 1991). This ensures the minimum size of the matching network, 
with only a single transistor per weight, and it provides for analog operation and 
binary decision output. 

An example bit-map template to be matched is shown in Figure 9.32(a). 
The template shown is represented by the vector x = [ 1 0 X 1 ] \ where X 

stands for a “do not care” condition. The simple CMOS network of Figure 
9.32(b) consists of up to four transistors MO-MB, one per bit, plus a bias tran- 
sistor MB , and it can realize the template bit-map matching function of interest. 
The transistors are sized in such a way that only the matching bit input pattem 
results in v in > V DD / 2. This requires symmetrical transfer characteristics for the 
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Figure 9.32 Template matching network: (a) example template bit map and (b) Circuit diagram. 
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two output inverters, which produce the voltage v a as a function of v in such that 
v 0 — V dd /2 when v i = V DD !2. 

When the input matches the value of 1 stored in the template, this tums on 
the single PMOS bit transistor. This occurs since the gate voltage of the PMOS 
device is pulled down to zero; see transistors MO and M2 as examples. The 
channel resistance of value R then connects the power supply to the summing 
line, as would be the case when x 0 = x 3 = 1 for the example network. For a 
matching bit of zero value, the corresponding bit transistor of the NMOS type 
remains open. This will be the case for Ml responding to x x = 0. 

Assume now that the bias transistor MB , which is always conductive, also 
has a channel resistance value of R. Note that when MO and M2 are tumed on 
in parallel, they represent the total resistance of 0.5 R from the summing line to 
the Vod node. Since the total resistance connected between the summing line 
and the ground is of value R , this yields a value of % for the voltage division 
ratio at the summing line. As a result, v in = (2/3)1^, and the output voltage 
is high, thus indicating that a matching input vector is present at the input bit 
lines. The reader can easily verify that for a mismatch of one or more bits, the 
summing line voltage drops below the value V DD / 2 and the output voltage v 0 
can be interpreted as assuming a logic low value. 

To design the more general bit template matching network exemplified in 
Figure 9.32, all transistors must be electrically identical. This requires their de- 
vice transconductances to be identical so as to yield the resistances of value R 
when the devices are conductive. In addition, the bias transistor MB needs to be 
sized properly so that the matching input vector yields the value of v in above 
0-5 V DD . More detailed inspection of the network leads to the observation that the 
bias transistor resistance should be chosen equal to R/(n — 1), where R is the 
resistance of each of the bit transistors and n is the number of bits equal to 1 in 
the template to be matched (see Problem 9.20). 


ANALOG MULTIPLIERS AND 
SCALAR PRODUCT CIRCUITS 

Electrically tunable synapses make it possible to multiply the input signal 
x t times the associated adjustable weight value. In a case for which the weight 
value w f is a function of the control voltage, the scalar product of input and 
weight vector can be expressed in the following form: 

^ n 

net = X w i( v d) x i (9.61a) 

/=i 

where the fth weight value is dependent on the controlling voltage v d , for i = 1, 
2, . . . , n. This approach to computing the product of weight and the input signal 
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based on (9.61a) has been used in most discussions of the analog implementations 
of neural circuitry with tunable weights presented thus far in this chapter. 

Altematively, the scalar product of an electrically tunable weight vector and 
of the rt-tuple input signal vector can be obtained as a sum of outputs of 
analog voltage multipliers as follows: 

net = c X (9.61b) 

i=i 

where v z is the voltage directly proportional to the weight value w h and c is 
the proportionality constant common for all n weights under consideration. In 
contrast to the approach taken in (9.61a), now the weight value can be represented 
directly using the variable voltage v,. 

Essentially, the approach based on (9.61b) tends to be slightly more costly 
to use. Expression (9.61a) allows for implementation of the product of an input 
variable and of a weight value dependent on a weight control voltage considered 
to be a parameter. In (9.61b), weights are treated as direct voltage variables rather 
than as constants controlled by electrical variables. Thus, the products of input 
signals and weights are obtained through simple analog multiplication. 

Below we look at several implementation techniques of analog multipliers. 
We will also cover scalar product circuits that make use of analog multipliers. 
Such circuits often prove useful for building artificial neural systems. 


Depletion MOSFET Circuit 


A simple integrated Circuit analog multiplier can be built from all MOS 
components (Khachab and Ismail 1989a). The Circuit requires an operational 
amplifier and fully matched depletion mode NMOS transistors working in pairs. 
The multiplier can operate with both positive and negative input voltages. The 
basic multiplier Circuit diagram is shown in Figure 9.33(a). The input voltage is 
x, and the gate voltage v of transistor M2 is used to express the weight value. 

Since both transistors need to remain in the resistive region in this circuit 
arrangement, their drain currents for y v < Kb — V dep are expressed as 


1 ds ^ 


(V gs - V dep )v^ - ^ 


(9.62) 


where V dep is the depletion voltage of each transistor, typically having a value 
between —2 and —4 V. Since for M 1 we have V ds = V gs = x, expression (9.62) 
can be rewritten for this device: 


^dsl k 


(x ~ V dep )x - 


£ 

2 


(9.63a) 


Note that for the transistor M2 we have V ds — x and V gs = x + v; thus, expression 
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(b) (c) 


Figure 9.33 Analog multiplier with two depletion mode MOS transistors and an operational 
amplifier: (a) Circuit diagram, (b) integrated Circuit implementation of resistance, and (c) another 
version of part (b). 

(9.62) for the drain current becomes 

hsi = k (x + v- V dep )x - X — (9.63b) 

Figure 9.33(a) shows that the difference between currents I dsl and I ds2 is con- 
verted to the output voltage v 0 by the closed-loop operational amplifier configured 
as a current-to-voltage converter. We thus have 

= ~( I dsi ~hs2)^F (9.64a) 

Plugging expressions (9.63) into (9.64a) results in the following output voltage 
for the analyzed analog multiplier when it is working in the linear region of the 
operational amplifier: 

v Q = R F kvx 


(9.64b) 
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The nonlinear (quadratic) terms of I dsl and J ds2 have eanceled due to the 
subtraction of both currents. Expression (9.64b) therefore remains valid provided 
both depletion mode transistors operate in the resistive region. Thus, the following 
condition must be fulfilled: 


v + jc < 


Va 


dep 


Formula (9.64b) implies that the Circuit of Figure 9.33(a) is able to perform four- 
quadrant multiplication. The multiplication constant of the obtained multiplier is 
proportional to the transconductance value k of the transistors used, and to the 
user-selected feedback resistance R F . Thus, the multiplication constant can be 
suitably adjusted by the choice of R F . 

An all-MOS realization of this multiplier can be accomplished by replacing 
the feedback resistor with two depletion mode transistors operating again in 
the resistive region. Two possible replacement circuit diagrams for resistors are 
shown in Figures 9.33(b) and 9.33(c). Drain currents for transistors Ml, and 
M2 can be obtained for the matching transistor pair Ml, M2 of Figure 9.33(b), 
respectively, as 



(9.65a) 


Idsl — k 


(V g - V dep )V - 


2 


(9.65b) 


Expression (9.65b) is in a standard form (see Appendix) with the condition 
v* = v. Expression (9.65a) is obtained from the same standard form with 
additional simplification such that V ds = V = V gs . Since the value of the equiv- 
alent resistance implemented by the two transistors conneeted in parallel can be 
expressed as follows 


R = 


V 

Idsl + Idsl 


(9.66a) 


substituting I dsl and I ds2 from (9.65) into (9.66a) results in the following formula: 


k(V g - 2F dep ) 


(9.66b) 


It can be seen that V g is the control voltage used to tune the value of the equivalent 
feedback resistor. The resistance value also depends on the depletion voltage % 
of the transistors used. Figure 9.34(a) shows the voltage /current characteristics 
of the all-MOS resistor of Figure 9.33(b). The curves shown were obtained from 
the circuit simulation using the SPICE2G.1 program for V g = 2 V, k = 20 
/iA/ V 2 , and for four values of V dep as tabulated in Figure 9.34(b) (Rao 1988). 

As stated before, the discussed circuit configuration functions properly only 
for the resistive mode of operation of both Ml and M2 and thus the following 
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Figure 9.34 Characteristics of the all-depletion mode MOS resistor of Figure 9.33(b) (V g = 2 V) 
for Iføp as a parameter: (a) voltage versus current characteristics and (b) equivalent resistance 
values. 



In contrast to the configuration discussed, the resistance produced by the config- 
uration illustrated in Figure 9.33(c) does not depend on the value of the depletion 
voltage V dep of the transistors used. Calculations similar to those performed for 
the resistor of Figure 9.33(b) yield the following equivalent resistance value 
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Figure 9.35 AII-MOS scalar product Circuit with saturation. 


for this Circuit: 




(9.68) 


However, to achieve cancellation of both nonlinear terms in the current 
expressions (9.65) and of V^ ep , the input voltages to the multiplier of opposite 
polarities must be produced and used as inputs for this configuration. 

The discussed analog multiplier cell from Figure 9.33(a) can be replieated n 
times to implement the full scalar products as expressed in (9.61b). Weight coef- 
ficients can be either positive or negative — depending on the polarity of voltage v 
used to control the sign and weight value. Figure 9.35 shows the circuit diagram 
of the scalar product circuit. It computes the following expression, which is a 
linear combination of the terms of Equation (9.64b): 


v„ = R F (k x v 1 x x + k 2 v 2 x 2 + . . . + k n v n x n ) (9.69) 

where k t is the transconductance coefficient of transistors in the fth input pair, 
for i = 1, 2 , . . . , n. 

Expression (9.69) becomes similar to (9.61b) for all transistors of identical 
electrical parameters when k x = k 2 — . . . = k n . The saturation level, as well as 
the steepness of the activation function, are set by the voltage V g9 which Controls 
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the feedback resistance R F according to either (9.66b) or (9.68), depending on 
the resistor configuration used. Both voltage values x { and ~x { are required to 
drive the inputs of a single synapse of this scalar product Circuit. These double- 
polarity inputs are intrinsically available when neurons are comprised of cascaded 
inverting amplifiers, or push-pull inverters, as described earlier in this chapter 
(Bibyk and Ismail 1989). 


Enhancement Mode MOS Circuit 


The basic concept of analog multipliers using all-MOS integrated circuitry 
presented in the last section can be extended to circuits with enhancement mode 
field-effect devices or other types of active devices. Again, the common-mode 
nonlinear terms of a transistor ’s nonlinear current characteristics can be elimi- 
nated, and the difference current signal can develop the product of two input 
voltages at the output of the analog multiplier. The circuit diagram of the analog 
multiplier cell is depicted in Figure 9.36 (Khachab and Ismail 1989b). The circuit 
comprises a single operational amplifier and four electrically identical, matched 
MOS input transistors that could be enhancement or depletion, and both either 
n- or p-type. 

Using the model of the transistors in the resistive region and with refer- 
ence to the notation of Figure 9.36, the current differences at the inputs of the 
operational amplifier can be expressed as follows (Khachab and Ismail 1989b): 


-h = 


/; 


k' 


'W\ 

— ) (*! -* 2 )(y, -y 2 ) 

(9.70a) 
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(9.70b) 


where ( W/L), and ( W/L)f are width/length ratios of the transistors’ channels 
in the input and feedback paths, respectively. Applying the KCL at each input 
node of the operational amplifier we obtain 

I x -I 2 =r 2 - /', (9.71) 

Hence, the output voltage v a becomes from (9.70) 

(w/Di (*i - x 2 )(y\ 


y 2 ) 


(9.72) 
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To ensure the operation of all transistors in the resistive region, the following 
conditions must apply at the input and output, respectively, of the multiplier 

min(y, - V th ,y 2 - V th ) 

V th ) 


x x ,x 2 


v c < min(z! - V th ,z 2 


(9.73) 


where V th is the threshold voltage of the matched MOS devices. As can be seen 
from (9.72), the Circuit achieves the product computation of difference input 
voltages x x — x 2 and y { — y 2 . In addition, the necessary biasing levels must be 
provided at the feedback transistors’ gates dependent on the type of transistors 
used. Enhancement mode w-type transistors require positive gate voltages exceed- 
ing V th if they are to operate properly. Depletion transistors, however, can accept 
positive, zero, or negative bias voltages at their gates. 

Let us require the four-quadrant multiplication to be performed by the mul- 
tiplier cell with weight control voltage v, and for the input voltage x applied at 
gate and input terminals, respectively. Letting 

X| = —x 2 = x 

yi =Qo + v, y 2 = Q 0 -v (9.74) 

Z 1 = Vgl’ z 2 = ^2 

we obtain expression (9.72) for the multiplier’s output voltage, which reduces to 
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the desired form containing a product of analog voltages x and v: 


4 (W/L)j 

(W/L) f (V g i - K/ 


(9.75) 


Note that a scalar product of two n-tuple vectors can be achieved by a straight- 
forward extension of the simple two-input multiplier just described. In such a 
case, a total of n four-transistor circuits (as in Figure 9.36) is needed to combine 
into products all n inputs and n weight control voltages applied at gates. Weights 
are set equal to the gate control voltages v h for i = 1, 2, . . . , n 9 and inputs are 
Xp for i = 1, 2, . . . , n. 


Analog Multiplier with Weight Storage 


An alternative analog multiplier Circuit useful for neural network applica- 
tions can be designed using a matched pair of enhancement type NMOS and 
PMOS transistors connected in a differential mode (Kub et al. 1990). An exam- 
ple of a circuit that uses two PMOS devices is shown in Figure 9.37. The weight 
values are represented by the analog voltage v at the output of the switch M3, 
which serves as an access transistor. The short-term weight storage capacitor C 
is connected to the gate of transistor Ml. The capacitor is charged to the weight 
control voltage v for normal weight operation during recall. Capacitor C can 
be either an intrinsic gate capacitance of transistor Ml or a specially formed 
capacitor structure. 

For transistors Ml and M2, both in resistive regions, the drain-to-source 
currents can be expressed as 


hs\ “ k 


(v - V th )x - Z 


(9.76a) 


Idsi ~ % 


(vj? - V th )x - 


(9.76b) 


where v A , is the weight reference voltage applied at the gate of transistor M2 
and assumed here to be constant. It is also assumed that V ds > V gs — V th since 
both Ml and M2 must remain in the resistive region to ensure an appropriate 
operation of this multiplier. The difference current can be obtained using (9.76) 
as 


Idsi ~ l dsi = k(y ~ v r)x (9-77) 

Inspection of Figure 9.37 reveals that the output voltages v ol and v o2 are equal, 
respectively, 


v o\ ~ hsl^F 
V o2 — ^dsl^F 


(9.78) 



To other 
multiplier cells 
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provided that both current-to-voltage converters work in the linear part of their 
operational amplifier characteristics. The output operational amplifier is connected 
as a difference amplifier so as to yield the amplified difference voltage of value 

= (v o2 - v„i)^ (9.79a) 

In the linear range of operation of the output operational amplifier its output 
voltage can be expressed from (9.77-9.78) as given: 

V a = k~Rp{v R - v)x (9.79b) 

K i 

Note from (9.79b) that the resistances R u R 2 , and R F can be used to Control 
the steepness of the activation function in the linear region. As explained in the 
previous section, these values also set the range of the analog product value and 
its linear or nonlinear mapping to the output v 0 . Obviously, if n multipliers are 
connected to each of the current summing lines, we obtain the total of l ds currents 
of each line as /z-tuple sums. In such cases, the output voltage v Q expressed for 
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a single weight by expression (9.79) yields the scalar product as follows: 

v 0 = k^R F i(v Ri -v i )x i (9.80) 

K\ i=l 

where v Ri is the reference voltage of the i’th weight. 

A 11 X 11 matrix-vector multiplier based on the discussed concept has been 
implemented in CMOS p-well technology (Kub et al. 1990). Measurements of 
analog capacitive storage characteristics have shown that the weight decay rate 
due to the lack of weight refresh has been 30 mV / s at room temperature. The 
intrinsic gate capacitances with an estimated value of 90 fF have been used as 
capacitive storage elements. The weight values have been permanently stored in 
electrically erasable programmable read-only memory (EEPROM) or RAM in 
digital form. The analog weight voltage values have been provided at the inputs 
to the access transistors as output signals of digital-to-analog converters. 

Substantial reduction of the weight control voltage decay rate has been 
achieved in the Circuit of Figure 9.37 by adding a second capacitance C R at the 
gate of transistor M2. Since the leakage currents of the gate capacitances of Ml 
and M2 are similar, they tend to cancel each other when the multiplier output 
currents are subtracted. In the measured integrated Circuit with both C and C R 
used, the decay rate has dropped to 0.6 mV/s. This configuration, however, has 
required the addition of another access transistor M4 to store the weight reference 
voltage v R (transistor M4 not shown in the figure). The multiplier cell from Figure 
9.36 with three transistors fabricated using the 3-jim p-well CMOS technology 
was of 56 fim 2 . The entire chip consisted of 11 X 11 matrix-vector multiplier 
circuitry. It has been configured to compute 11 scalar products simultaneously, 
each with 11-tuple vectors. 


Floating-Gate Transistor Multipliers 

Floating-gate transistors offer rather simple and efficient Circuit implementa- 
tion techniques for artificial neural systems. In particular, the technique is suitable 
for designing large networks fabricated with modem VLSI technologies (Bibyk 
and Ismail 1990). The most attractive feature of floating-gate MOS transistors 
used as analog multipliers for neural networks is that both the weight storage and 
analog multiplication of weights and inputs are implemented concurrently and 
by the same circuitry. The floating-gate transistors provide an adjustable, non- 
volatile analog memory, and, simultaneously, they behave as analog processors 
when appropriately connected. 

The cross-section of a floating-gate MOS transistor is shown in Figure 
9.38(a). The device is similar to the MOS transistor, but in addition to the 
regular gate shown as a top layer, it has one more layer of polysilicon to store a 
trapped charge. Once the charge is trapped in this layer called the floating gate. 
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(a) (b) 

Figure 9.38 Floating-gate MOS transistor (n-type): (a) device cross-section and (b) symbol. 


it produces a shift in the threshold voltage of the transistor. Figure 9.38(b) shows 
the symbol of the device. Note that the floating gate is not connected to any 
other part of the circuitry. 

Let us briefly examine the operation of the floating-gate transistor. Before 
the device is programmed, it behaves as a regular n-channel enhancement mode 
MOSFET. In this case its threshold voltage is rather low. To program the floating- 
gate transistor, a large pulse voltage of about 20 V is applied between the gate 
and source terminals. Simultaneously, a large 25 V pulse of voltage is applied 
between the drain and source terminals. Due to the physical phenomena described 
in more detail elsewhere in the literature (Sedra and Smith 1991), the floating 
gate acquires charge. In this State, called the programmed State, the device thresh- 
old voltage is substantially higher than in the nonprogrammed State. Moreover, 
the value of the programmed threshold voltage can be controlled through the 
duration and level of voltage pulses applied. Once programmed, the device is 
able to retain the threshold value for years. 

To return the floating-gate MOS transistor to the non-programmed State, the 
erasure process needs to take place. The erasure corresponds to the removal of 
the charge from the floating gate. It is accomplished by illuminating the floating 
gate with ultraviolet light of a suitable wavelength. 

Let us now focus on the application of floating-gate devices for analog 
multipliers with simultaneous weight storage. The multiplier consists of a single- 
transistor current sink and two threshold-programmable transistors with gates tied 
together and connected as a differential pair as shown in Figure 9.39(a). Because 
transistors Ml and M2 are a perfectly matched pair, and thus nonprogrammed, 
h\ and I d2 remain exactly the same. Through device programming, the threshold 
voltages are made unequal. The resulting difference of drain currents hx and hi 
becomes proportional to the difference of the programmed threshold voltages. 

Assume that the current differences are sensed at the input of the neuron 
that produces the scalar product function and its sigmoidal mapping as discussed 
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Figure 9.39 Floating-gate weight storage /multiplier configurations: (a) Circuit diagram and 
(b) modified Circuit. 


earlier in this chapter. When input x is low, the circuit makes no current contribu- 
tion to the neuron since Ml and M2 are cut-off and l dl = I d2 = 0. For x larger 
and positive, the difference current / diff becomes proportional to the threshold 
voltage difference of the two transistors. 

An analysis of this circuit (Borgstrom, Ismail, and Bibyk 1990) shows that 
the differential current can be expressed: 


^diff “ I 


d\ 


l d2 


(9.81a) 


^diff ~ V* 




thd 



(9.81b) 


where the threshold voltage difference is %hd = V th2 — V thi , and k is the device 
transconductance common for both Ml and Ml. It is seen from (9.81b) that the 
current / diff is proportional to V thd , and also to the input voltage x in a narrow 
range of values. Although the latter relationship is not explicit in (9.81b), note 
that x establishes the level of the total current I ss . To improve the performance 
of the circuit, simple modification of swapping x and V hias as shown in Fig- 
ure 9.39(b) can be undertaken. The current / diff for this modified circuit is of 
approximate value 


'diff 


= XV, 


thd 


I*. * 

2 K mp K 


(9.82) 


where k inp is the device transconductance of the input transistor M3. Thus, the 
differential current 7 diff in (9.82) is proportional to the product of both the input 
voltage x and of V thd . The value of V thd , however, is made through program- 
ming proportional to the weight value. Figure 9.40 shows the differential current 
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(a) 


(b) 


Figure 9.40 Differential current characteristics for the modified Circuit of Figure 9.39(b): 

(a) versus the input voltage and (b) versus the difference of threshold voltages. [© IEE; adapted 
from Borgstrom, Ismail, and Bibyk (1990) with permission.] 


responses of the modified multiplier Circuit from Figure 9.39(b). The graphs have 
been obtained through Circuit simulations. They illustrate excellent linearity of 
Aiiff with respect to V thd , and good linearity with respect to x, in particular for 
large values of inputs. This can be seen from comparison of curves from Figures 
9.40(a) and (b). 

Floating-gate multipliers from Figure 9.39(a) can be used with a current 
comparator, example of which are shown in Figure 9.3 l(a). The inputs to the 
synapses, x, are single-ended voltages. The effective “weight” represented by the 
synapse is proportional to the difference of the two currents hi and hi' which is, 
in tum, proportional to the preprogrammed difference in the threshold voltages 
V th2 and v m .a single neuron with floating-gate programmable synapses can be 
obtained by connecting inputs hi of a number of synapses from Figure 9.39 
to the input / + of the current comparator of Figure 9.3 l(a), and by connecting 
inputs I d2 of the synapses to the input /_ of the current comparator. 

An experimental prototype of the neural network described above has been 
designed, fabricated, and tested (Borgstrom, Ismail, and Bibyk 1990). The chip 
was implemented with a 2-ju.m CMOS double-poly p-well process. Poly 1 layer 
has been used to make the floating gate, and the poly 2 layer as the top gate. 
The transistors’ aspect ratio used in the multiplier has been 20 /xm/20 /zm, 
and large W and L values have been selected to minimize mismatch due to 
the geometrical inaccuracies. The designed neural network uses the described 
current-mode technique and is especially attractive for modem VLSI technology. 
It offers programmable weights, excitatory and inhibitory current summation, the 
sigmoidal nonlinearity function of the current comparator, and the output in the 
form of voltage. The current-mode processing and current-mode budding blocks 
used by the network are of special importance for microelectronic technologies 
where voltage signal handling is often limited for analog applications. 
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— ASSOCIATIVE MEMORY 
IMPLEMENTATIONS 

This section briefly describes how to integrate electronic circuit components 
into an associative memory circuit. Since only modest precision is required for 
the nonlinear processors (neurons) and coupling elements (weights), associative 
memories are particularly convenient and attractive for microelectronic imple- 
mentation. In a typical memory circuit, input and output data are in binary form. 
However, as explained in Chapter 6, analog computation is performed by the 
network throughout the recall phase. This feature makes it easy to integrate an 
associative memory circuit as a specialized module within large digital systems. 
Although feedforward networks can also be termed, at least in a certain respect, 
as associative memories, we have reserved the term “memories” for networks that 
are typically trained in the batch mode, and are dynamical systems with feedback, 
which is activated during the recall phase. Only such dynamical memory systems 
are discussed here. 

Figure 9.41 shows a general schematic representation of a single-layer as- 
sociative memory. The programming of the connection matrix is performed by a 
weight control network, which is responsible for storing appropriate weights at 
suitable locations within the connection matrix. The connection matrix contains 


Weight 
control input 



Figure 9.41 Schematic diagram of a single-layer autoassociative memory network. 
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binary or, more accurately, temary weights of values 0, +1, and -1; sometimes 
weights need to be set with higher resolutions. The weight Control network 
can be designed either as a microprocessor-based controller or as a dedicated 
memory control network. This type of weight control network usually involves 
either a digital memory of RAM type, or registers, as well as address decoders, 
transmission gates, and buffers. 

One of the first associative memory circuits involves implementation of 
binary weights of values 0, +1, and -1 (Graf and Vegvar 1987). The weight 
circuitry has been designed as shown in Figure 9.42. The figure depicts a sin- 
gle weight configuration and an associated weight control circuitry. The Circuit 
from Figure 9.42(a) uses two RAM cells to program the single weight value. If 
DataOutl = 0 and DataOut2 = 1, both transistors M2 and MA are cut-off, thus 
allowing no current to or from the input line of the fth neuron. This programs 
a missing connection, or w t j = 0. 

Assume now that v y = V DD . An excitatory current flowing toward the fth 
neuron can be produced by programming DataOutl = DataOut2 = 0. The ex- 
citatory current / exc , which is sourced to the line, can be denoted as shown on 
the diagram of Figure 9.42(b). Tied transistors M3 and MA can be represented 
as their respective channel resistances R ds3 and R ds4 . This set of control signals 
from RAM cells implements = 1. 

When the memory outputs DataOutl = DataOut2 = 1, while v y remains 
positive, the direction of current reverses as shown in Figure 9.42(c). In this way 
Wfj = — 1 is produced due to the sinking of current away from the current input 
line of the fth neuron. This Circuit arrangement imitates the apparent negative 
conductance g t j = w t j of unity value between the output of the fth neuron 
and the input of the fth neuron. This negative conductance effect is due to the 
counterflow of sinking current z inh with respect to the sourced current z exc . 

The circuitry of Figure 9.42 implements fixed and temary weight values. It 
is obvious that by selecting weighted resistances of MOSFET devices, the amount 
of current sourced, / exc , or current sinked, / inh , could be controlled with better 
resolution. Several Circuit techniques to achieve a higher resolution of resistances 
along with the principle of analog or digital tunability of weights that could be 
used for associative memory circuits were discussed earlier in this chapter. 

An example auto-associative memory chip was fabricated with 2.5 /un 
CMOS technology (Graf and Vegvar 1987). The chip contained about 75 000 
transistors in an area 6.7 mm X 6.7 mm. Figure 9.43 shows the schematic of 
memory interface with the data input /output. There are 54 neuron-amplifier pairs 
used in the memory, each neuron-amplifier being a simple inverter. Data input 
and output are made through transmission gates TI and T2, respectively. Data 
are fed through a buffer containing one storage cell, each connected to the output 
neuron. The content of the buffer can be utilized to program synaptic connections 
using the approach shown in Figure 9.42. 

To initialize the memory output, the read/write transistor needs to con- 
duct in order to pass the signal to the neuron from the buffer. The memory 





9.5 ASSOCIATIVE MEMORY IMPLEMENTATIONS 


647 


Data in Data out 



matrix 

Figure 9.43 Neuron cell with buffer unit and controlling gates. [Adapted from Graf and Vegvar 
(1987); © IEEE; used with permission.] 


feedback loop remains open during the initialization. Following initialization, the 
run control signal is tumed to initiate recall, and the network evolves to an 
equilibrium State. After the circuit has reached a stable State, the read/write 
transistor passes the output voltage data to the buffer. These data can be read 
out as the components of the memory recalled binary vector. 

The fabricated network consists of 54 2 = 2916 binary weights designed 
according to the concept of Figure 9.42(a) (Graf and Vegvar 1987). Tests on the 
chip have been run with up to 10 vectors programmed as stable States. Auto- 
associative recall has been performed reliably, and only the States programmed 
into the circuit and no spurious States have been retrieved during experiments. A 
large spread of convergence times has been observed dependent on the overlap 
of the input and memory vector. The measured recall times have been between 
20 and 600 ns. In more general cases, the convergence time scale would depend 
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on the product of weight resistances times the intrinsic input capacitances of the 
neurons used. The power dissipation measured for the single convergence of the 
memory has been typically of the order of 200 mW, reaching 500 mW in the 
worst case. 

The drawback of single input line current summation used in this memory 
design is that excitatory and inhibitory currents may not be the same due to the 
mismatching between the sourcing and sinking circuitry. Since the mismatches 
are subsequently summed, this can produce a substantial error in the total input 
current value (Verleysen and Jespers 1991). To compensate for potential lack of 
symmetry due to the mixture of n- and p-type MOSFETs in weight circuitry as 
in Figure 9.42, a two-line system can be used instead. Such arrangements were 
already discussed in more detail in Section 9.3 for neural network architectures 
other than associative memories. 

In this technique, excitatory and inhibitory currents are summed on separate 
lines and weight currents are contributed to each of the lines by the same type 
transistors. Figure 9.44(a) illustrates a two-line current summation. Each synapse 
is a differential pair controlled by two RAM cells as shown in more detail in 
Figure 9.44(b) (Verleysen and Jespers 1991). If DataOutl = 1, current is sunk 
through M0, provided either Ml or M2 is conductive. Transistor Ml conducts 
and contributes to the excitatory current when Out is high. At the same time M2 
remains cut-off. The eontribution to the inhibitory current is generated when Out 
is low, thus M2 is conducting and Ml is cut-off. The current mirror Circuit needs 
to be used as described in previous sections in order to compare the excitatory 
and inhibitory currents and to produce the output voltage as a function of their 
difference. 

As suggested earlier, associative memories can also utilize the channel re- 
sistances of MOSFET devices as tunable weights to achieve weight setting with 
higher resolution. The concept of using properly biased field-effect transistors 
to implement weights was discussed in more detail in Section 9.2. Let us now 
look at an example of a hardware implementation of a versatile and expandable 
connection matrix chip reported by Moopenn et al. (1988). 

A single synaptic chip with 1024 weights has been implemented that con- 
tains an array of 32 X 32 “long channel” NMOS transistors as illustrated in Figure 
9.45. Since neurons are kept off-chip, the Circuit remains purely “synaptic” and is 
thus easily cascadable with identical chips. Such a multichip architecture allows 
for expansion of the chip up to the size of 512X512 weights, if chips are 
arranged in an array. In addition, the chips can be connected in parallel, thus al- 
lowing for higher than binary weight resolution. In this type of connection mode, 
the conductances are added since individual weights are connected in parallel. To 
ensure low-power dissipation within the connection matrix, weight resistances are 
selected from large values, thus taking small current levels. For this reason, large 
L/W (long channel) transistors are used on the chip (L = 244 /xm, W - 12 jim). 
This ensures typical channel resistance R ds values greater than 200 kft in the 
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Figure 9.44a Two-line current summation: (a) Circuit diagram. 


entire operating range [see formula (9.21)]. Also, a large ratio L/W enhances 
the linearity of the controlled weights, for a given range of operating voltages, 
thus it results in increased weight accuracy. This also increases the precision of 
implemented weights, which becomes of particular importance when connection 
matrices need to be connected in parallel in order to achieve better resolution. 

Figure 9.45(a) depicts the block diagram of the fabricated synaptic chip 
and also shows an expanded single weight cell. Transistor M2, controlled by the 
gate with voltage Q, acts as an on/off switch. The Q output is provided by the 
latch, which is extemally addressed via the row and column address decoders. 
Transistor Ml operates as a long channel resistance R ds controlled by the gate 
voltage V G . Since V G takes only one nonzero value in this Circuit arrangement, 
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Figure 9.44b Two-line current summation (continued): (b) implementation of binary weight. 


only binary unipolar weights can be produced in this Circuit. With a weight 
control voltage V G of 5 V, for example, values for R ds of about 200 kil can be 
obtained over an input operating range. The weights’ linearity of 5% has been 
measured for the input voltage in the range up to 1.5 V. 

Figure 9.45(b) shows the microphotograph of the 32 X 32 connection chip. 
The synaptic chip has been fabricated using 3 ixm bulk CMOS process. A mi- 
crocomputer interface needs to be used in this project to program the weight 
values, to initialize the memory, and to provide read-out of the recalled vector. 
The chip has been used as a basic building block for a number of associative 
memory experiments. An average convergence time within the connection chip 
has been observed of 10 [xs , with some recalls taking up to 50 fx s. The dis- 
cussed associative memory network has demonstrated an excellent stability and 
associative recall performance. 

Another approach to building a reconfigurable memory connection matrix 
is shown schematically in Figure 9.46. The weight value can be adjusted by 
tuming the binary weighted resistances on or off. The resistances are controlled 
by 1-bit RAM memory switches (Lee and Sheu 1991). The figure exemplifies 
weights that can be arranged extemally for up to 3-bit resolution. Resistances 
can be realized either as channel resistances of MOS transistors or as current 
sinks/sources connected to current mirrors. 
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Figure 9.45 Synaplic chip with long channel MOSFETs and binary weights: (a) chip architec- 
ture and single weight implementation and (b) microphotograph of a CMOS 3-/u.m chip with 1024 
weight cells. [© American Institute of Physics; adapted from Moopenn (1988) with permission.] 





ELECTRONIC 
NEURAL PROCESSORS 

In this chapter we have formulated the basic requirements and outlined the 
operational parameters of artificial neural hardware processors. We also discussed 
numerous issnes related to the design of special-purpose artificial neural systems. 
Much of the material presented to this point has been oriented toward the imple- 
mentation of the basic budding blocks of neural hardware. Synaptic connections, 
neuron elements, elementary current-mode processors, and scalar product circuits 
have been among the main budding blocks covered. Our emphasis on the analog 
neural processing has been due to the faet that it offers size and speed advantages 
compared to conventional numerical simulations. Analog neural processing is also 
much more efficient than purely digital hardware computation since convergence 
times for parallel analog hardware are significantly smaller. We have also seen 
that some electronic implementations can be made as combinations of analog 
and digital processors. While recall and often also leaming are carried out as 
analog processes, digital programmability and digital control and interface offer 
inereased accuracy and flexibility. 

This section discusses sample electronic neural network implementations. 
Rather than continue to focus on the principles and basic building blocks, here 
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we outline the main features of ready-to-use neural processors. Some of them 
appear experimental, some are available commercially — nevertheless, all of the 
discussed systems are reported to be complete and functional as described in 
the technical literature. Included in this section are Intel’s ETANN chip (Intel 
1991), the analog programmable neural processor (Fisher, Fujimoto, and Smith- 
son 1991), a general-purpose neural computer (Mueller et al., 1989, 1989b), and 
an optically controlled neural network (Frye et al. 1989). 

The Electrically Trainable Analog Neural Network (ETANN) 80170 inte- 
grated Circuit chip is a commercial product suitable for real-time neurocomput- 
ing applications. Its applications include recognition of printed or handwritten 
characters, diagnosis of fault pattems, signal recognition, image processing and 
convolution, speech recognition, nonlinear process control, robotic motion control 
and tactile pattem recognition. The chip operates in the recall phase with a speed 
of 2 • 10 9 CPS, which is well beyond the capabilities of both the neural network 
software and hardware accelerator boards. 

The ETANN 80170 chip is trainable in two modes. Given a specific map- 
ping task to be performed, synaptic weights can be trained for a selected neural 
network model. This task can be done off-chip either by the user’s software or by 
the leaming simulation software provided with the chip. Downloading the weights 
on the chip terminates this leaming performed “off-the-chip” on a programmable 
host neurocomputer. Also, off-chip leaming maximizes the flexibility of training 
approaches and of architectures potentially chosen for the network. 

Altematively, the chip can be subject to “chip-in-the-loop” leaming. In this 
mode of training, the network itself performs the recall function and is trained 
on-line, while software is used for recomputing the weights during training. This 
mode of leaming enables more realistic network development because the chip 
characteristics and imperfections are included and accounted for in each weight 
adaptation step. Because the chip uses EEPROM technology, the weights can be 
changed many times. The programmable weights are trainable with up to six bits 
of resolution. 

The ETANN chip comes in a 208-pin package. It consists of 64 pro- 
grammable neurons and a total of 10 240 programmable weights. The block 
diagram of the chip is shown in Figure 9.47(a). Each neuron is capable of com- 
puting the output for 128 input weights with variable inputs and 32 fixed-bias 
weights. Figure 9.47(b) illustrates programmable synapse operation. The weight 
value is programmed and stored as threshold voltages of a pair of floating-gate 
MOS transistors as described in detail in Section 9.4. The circuit shown in the 
figure performs the analog multiplication of the differential voltage Av in times the 
weight value, which, in tum, is proportional to the threshold voltage difference. 
The output from the synapse is in the form of the current difference A/ out . 

Families of the synaptic connections’ current characteristics shown in Figure 
9.48 illustrate the performance of programmable weights. Figure 9.48(a) shows 
the differential output current as a function of the weight input voltage with 
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Figure 9.47a ETANN 80170 chip: (a) block diagram. (Courtesy of Intel Corporation) 
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Figure 9.47b ETANN 80170 chip (continued): (b) synaptic connection implementation. (Cour- 
tesy of Intel Corporation) 


the threshold voltage difference V M , called here V^_ Diff as a parameter for each 
curve shown. The six different values of weights displayed in the figure have 
been obtained with the threshold voltage value of one of the transistors fixed as 
a reference, V ren . The graphs show that the weights can be set with fine accu- 
racy and a considerable linearity of weights is achieved. Figure 9.48(b) shows 
the differential input current versus the difference of the threshold voltages V M 
with the input voltage as a parameter for the family of curves. Noticeably, the 
inputs and outputs of the weight cell are both differential. Weight changes are 
accomplished by applying high-voltage pulses to the ETANN chip and causing 
the floating gate to store the weight value as described earlier in Section 9.4. 

Currents generated by products of up to 160 weights times the inputs are 
summed and converted to an output voltage. This is accomplished by the neuron 
characterized by an activation function as shown in Figure 9.49. The sigmoidal 
function takes a voltage-controlled shape. The neuron’s output voltage corre- 
sponds to a “squashed” scalar product of input and weight vectors, and it depends 
on the value of the control voltage VGAIN. 

The ETANN chip can be programmed and trained using a development 
system that plugs into a personal computer, which serves as a host. The develop- 
ment system contains converters, conditioners, multiplexers, and logic controllers, 
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Figure 9.48 Synapse differential output current characteristics: (a) versus differential input volt- 
age, V tM as a parameter and (b) versus V thd , differential input voltage as a parameter. (Courtesy 
of Intel Corporation) 
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sigmoid characteristics versus VGAIN at VREFO - 1.4V 



Figure 9.49 Neuron activation function versus differential input current, VGAIN voltage as a 
parameter. (Courtesy of Intel Corporation) 


which are needed to program the neural network chip. System software is avail- 
able to either verify the downloaded weights or read weights out of a chip that 
has been previously trained in an on-line mode. The ETANN chip can be wire- 
wrapped into a variety of architec tures. Either cascaded or parallel architectures 
can be developed that employ up to 1024 neurons and up to 81 920 weights in 
the interconnected network. 

The analog neural network processor described by Fisher, Fujimoto, and 
Smithson (1991) and discussed below represents a somewhat different category 
of neural hardware. In contrast to the ETANN concept, the processor can be re- 
programmed on the fly as if it were read/write memory. The processor ’s analog 
throughput does not need to be halted during the weight adaptation or reprogram- 
ming. Thus, while ETANN is best suited for ultrafast execution of problems that 
require rather infrequent retraining, the analog neural network processor can be 
modified while working. In addition, the ETANN chip uses current-mode signal 
processing rather than the voltage-mode processing employed in an analog neural 
network processor. 

The processor can be connected so it functions as either a feedforward 
or feedback network. The 256 neurons are made of operational amplifiers. The 
processor weights are arranged in an array of custom programmable resistor chips 
assembled at the board level. The 2048 programmable resistors are physically 



658 


NEURAL NETWORKS IMPLEMENTATION 


located on 16 boards. The overall architecture of the processor is shown in Figure 
9.50(a). Input to the network can be simply analog, or from 12-bit digital sources 
converting the digital signal to analog form in the block of 256 digital-to-analog 
(D/A) converters. 

The experimentally tested D/A con version rate for the processor built has 
been 5 kHz. The processor input data can be provided extemally or from the 
computer. The analog output data can be monitored after a conversion of up to 
256 outputs by a set of 256 A/D converters. The interconnection boards used by 
the processor adjust it to the problem class and are architecture specific, while the 
rest of the processor is general-purpose. While the system requires more space 
than an integrated neural network chip, the use of the boards and wire-wrap 
interconnect boards makes it possible to reconfigure the network easily. 

The heart of the analog network of the processor is made up of the digitally 
programmable analog resistors. A single programmable weight diagram is shown 
in Figure 9.50(b). The resistors are made on a gate array chip. This weight 
implementation concept is similar to that described in Section 9.2 for MDACs 
with binary weighted resistances. This processor, however, uses integrated Circuit 
resistors rather than MOSFETs. Each weight consists of a buffer connected to 
five polysilicon resistors through analog bipolar switches. The resistors range 
from 8 to 128 kfl and can be set up with an 8- kfl step, thus providing 4-bit 
resolution. This resolution has been found adequate for the neural Control appli- 
cations studied by Fisher, Fujimoto, and Smithson (1991). The buffered design 
allows for the connection of many resistors to the neuron’s input without placing 
excessive current load on the summing and amplifying neuron. Also, since the 
weight program data are stored in digital form within computer memory, or 
registers, no refresh signals are required to store their values. 

The programmable neural network processor has been successfully tested 
for control of adaptive mirrors’ actuators. The purpose of adaptive mirrors is 
to undo the distortions of images of solar or celestial objects. Such images are 
distorted when the light forming the image passes through the atmosphere. The 
light reflected from adaptive mirrors can produce corrected images. However, 
the mirror adaptation cycle must not take longer than 10 ms to keep up with 
atmospheric changes. The neural processor described has been able to control the 
mirrors significantly faster than any of the conventional digital control methods 
attempted. 

Another example of an electronic neural network is described by Mueller 
et al. (1989a, 1989b). The network is designed and implemented in the form of 
a general-purpose analog neural computer. It consists of three types of intercon- 
nected modules. They are dedicated VLSI chips: neurons, modifiable synapses, 
and routing switches. The neurocomputer runs entirely in the analog mode but the 
architecture, weights, neuron parameters, and time constants are all set digitally 
from a host computer. 
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Figure 9.50 Programmable analog neural network processor: (a) block diagram and (b) pro- 
grammable resistor weight schematic. [© IEEE; part (a) reprinted, and part (b) adapted, trom 
Fisher, Fujimoto, and Smithson (1991), with permission.] 
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Figure 9.51 Top view of the photosynapse. 


An interesting feature of this neurocomputer is that the analog outputs of 
the neurons are read through analog multiplexers. Subsequently, these data are 
digitized and stored in digital memory. During training, the weights and threshold 
values of the neurons are computed by the host computer and set on the neuron 
and synaptic chips. The multiplexing operation does not interfere with the actual 
analog computation. 

The synaptic weights are implemented by current mirrors that scale the 
currents with 5-bit resolution. Since the neurons’ outputs are voltages, voltage- 
to-current converters have been used to provide appropriate current inputs to the 
current mirrors. The converters are followed by the current dividers consisting of 
current mirrors in series. To retain the weight value, synaptic cells are provided 
with local 6-bit weight control memory. Five bits determine the weight magnitude 
and the sixth bit is used to determine the sign of the weight. The memory consists 
of shift registers that read the data during the programming phase. 

The neuron Circuit consists of a rectified summing amplifier, comparator, 
and an output driver. Inputs to the neurons are currents; outputs are analog 
voltages. Digital control can be used to adjust the following neuron parameters: 
bias (threshold), minimum output, and the choice of a linear or sigmoidal transfer 
function. The sigmoidal transfer function is obtained by adding one or more non- 
linear devices in parallel with the feedback resistor of the summing operational 
amplifier. The neuron chip fabricated consisted of 16 neurons implemented in 
2- fim CMOS technology. The reader is encouraged to refer to the literature for 
more detail on this versatile circuit (Mueller et al. 1989a, 1989b). 

An interesting example of a neural network with optically controlled hard- 
ware has been described by Frye et al. (1989). Synaptic weights can be built 
using amorphous Silicon photoconductors. Such weights, which can be called 
“photosynapses,” react to the amount of light energy projected onto them. A 
schematic illustration of a photosynapse (top view) is shown in Figure 9.51. The 
photosynapse is composed of three narrow strips of metal deposited on a layer 
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Figure 9.52 Weight conductance (normalized) versus programmed value for the photosy- 
napse: (a) image iength modulation and (b) image intensity modulation, ful! bar length. [© IEEE; 
reprinted from Frye et al. (1989); with permission.] 


of glow-discharge amorphous Silicon. The region between input and either the 
negative or positive output line forms a photoconductor whose electrical con- 
ductance can be controlled by an image projected on the synapse. The weight 
conductance is a function of the length of the bar of light projected and also of 
its intensity. In addition, the length of the bar is proportional to the length of 
the image. 

The image projected on the matrix of synaptic connections has been gener- 
ated by a high-intensity CRT. The length of the image bar spanned 240 pixels. The 
weight conductance has been coded accordingly into the integer values between 
0 and 240. Altematively, the intensity of light alone has been used to Control 
the conductance encoded as an integer ranging from 0 to 255. This intensity has 
been projected on the photosynapse when using a full-length bar of 240 pixels. 

Measured values for the normalized weight conductance are shown in Figure 
9.52. It can be seen that the conductance modulated by the length of the bar at full 
intensity displays a linear relationship. The behavior of weight as a function of 
bar intensity, however, shows a very different functional dependence. Because of 
the nature of the CRT light and of the properties of the photoconducting material 
employed to make the photosynapse, the intensity-modulated conductance is a 
strongly nonlinear function of the programmed value, showing only a small value 
of conductance for programmed intensity values below 200 (Frye et al. 1989). 
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Figure 9.53 Optically controlled multilayer feedforward network: (a) neuron Circuit diagram and 
(b) block diagram of the system. [Adapted from Frye et al. (1989); © IEEE; with permission.] 
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As shown in Figure 9.51, the weight current can be either excitatory when 
/+ > 0, or inhibitory when /__ > 0. A simple neuron circuit performing an 
aggregation of currents £/+ and £/_ is shown in Figure 9.53(a). The first 
stage consists of current-to-voltage converters, or transconductance amplifiers 
built with operational amplifiers O Al and OA2. A unity-gain inverter built with 
OAi provides the sign change for the voltage but without scaling of its magni- 
tude. An inverting summer mode with OA4 provides for the summation of the 
voltages of both channels. The two diodes in the feedback loop are provided to 
give the neuron sigmoidally-shaped response characteristics rather than piecewise 
linear-shaped characteristics. 

The described network with photosynaptic connections has been used in a 
multilayer feedforward architecture trained in the error back-propagation mode. 
The block diagram of the overall error back-propagation training system is illus- 
trated in Figure 9.53(b). The extemal computer is used to generate the training 
data and to evaluate the error produced by the actual system with the current 
setting of weights. Then, the new images are produced and sent to the photosy- 
napses that correspond to the newly adjusted weight values. The output of the 
neural system and of the hidden neurons is monitored in each training step so that 
the new set of weights can be recomputed and translated into light images. The 
described network has been successfully trained to locate the target point of a 
ballistic trajectory. The system has proven very useful despite some disadvantages 
displayed by the actual hardware. 


9.7 in i. 

CONCLUDING REMARKS 

Throughout this book we have seen that a wide variety of problems can 
be solved by using the neural network computing framework. However, each 
of the classes of problems that needs to be solved usually requires a different 
architecture and training approach as well as different data stored in the network 
in the form of weights. The exposition in this chapter has focused on building 
working models of neural net works which we called neural processors. Among 
the numerous approaches to implementation of artificial neural systems we have 
stressed microelectronic hardware realizations. We have emphasized analog cir- 
cuits to implement neural systems because they provide high synapse density 
and high computational speed (Satyanarayana, Tsividis, and Graf 1990). Once 
the algorithms are developed and the networks trained, such realizations provide 
an efficient way of solving a variety of technical problems. In most cases, neural 
analog processors offer improvement over both software simulations and digital 
circuits emulating analog operation. 
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Most real-life applications of artificial neural networks appear to typically 
require at least several hundred neurons. An essential aspect of neural hard- 
ware implementation is the mechanism used to represent and manipulate the 
weights. Although more difficult to accomplish than weight adjustment, leaming 
also appears to be an essential ingredient for most neural network applications. 
A number of designs to accomplish these features within the neural network 
hardware rather than on the host computer have been presented in this chapter. 

While some approaches use off-line leaming to train an artificial neural 
system, which is both possible and practical, the inherent ability to leam on the 
chip as more information becomes available to the system is invaluable. However, 
the development of an artificial neural system chip with on-chip leaming seems 
to be a nontrivial task. Even for limited leaming capability, the network design 
becomes quite complex (Collins and Penz 1990). Development of standard dig- 
ital application-specific integrated Circuit hardware requires high expenditures to 
design, fabricate, and test the digital Circuit chip. Development of neural network 
hardware seems to be even more costly. This is one of the reasons why the neu- 
ral network chips currently available often involve technology demonstrations 
as opposed to the easily usable, perhaps even hand-held circuits for practical 
applications. The task of making a universal and usable neural network chip still 
remains a challenging one. 

Let us consider what could be called an “ideal” neural network. Such a 
network should possess the folio wing characteristics: 

■ Work for any neural processing algorithm 

■ Contain at least 1000 neurons, which can be interconnected globally 

■ Have programmable analog weights 

■ Be able to leam on-chip 

b Consist of small-area neurons and interconnections 

■ Operate at low-power levels 

■ Be stable, reproducible, and extendable so that larger systems can be 
built through interconnecting neural network building blocks 

■ Be affordable. 

Let us compare the above requirements with the present capabilities of integrated 
Circuit technology. One of the most complex semiconductor chips currently avail- 
able (early 1992) is the 1-Mbit DRAM. One can say that a 1-Mbit DRAM is 
roughly equi valent to a 1000-neuron network, which involves about one million 
weights. The DRAM network, however, stores binary information in contrast to 
the expected analog weight storage of the neural hardware. Therefore, just from 
the standpoint of granularity of data storage, it is much easier to fabricate a 1- 
Mbit DRAM chip than a 1000 neuron network chip. Additional expectations such 
as weight tunability, leaming capability and required nonlinear processing by the 
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neuron bring even more complexity to the neural hardware picture (Collins and 
Penz 1989). 

Neural network chips available at present, including those discussed in this 
chapter, meet somewhat relaxed criteria compared to the above list of desirable 
features and are of reduced complexity and size. The chips made thus far contain 
on the order of 100 processing nodes. We can State that the currently available 
neural hardware, including VLSI chips, still lags somewhat behind the neural net- 
works’ theoretical computational capabilities. Therefore, most large-scale neural 
computation is still done on serial computers, special-purpose accelerator boards, 
or by theoretical analysis. 

Out of necessity, the exposition of neural network implementation in this 
chapter has been limited to a number of approaches and design techniques. 
Also, microelectronic VLSI technology has been emphasized for the most part. 
Other designs and approaches are available and are described in more special- 
ized technical reports. The reader may refer to the literature for further study of 
the subject. 

Below is a partial list of topics and related literature: 

■ Neurocomputing architectures (Bessiere et al. 1991; Goser et al. 1989; 
Jutten, Guerin, and Herault 1990; Morton 1991; Ouali, Saucier, and 
Thrile 1991; Ramacher 1991; Vidal 1988) 

■ Architecture of neural VLSI chips (Wasaki, Itario, and Nakamura 1990; 
Schwartz and Samalam 1990; Foo, Anderson, and Takefuji; 1990; Kaul 
et al. 1990; Mack et al. 1988; Mann 1990; Lambe, Moopenn, and 
Thakoor 1988) 

■ Silicon implementation of neural networks (Maher et al. 1989; Murray 
1991; Paulos and Hollis 1988; Reed and Geiger 1990; Rossetto et al. 
1989; Salam and Choi 1990; Verleysen and Jespers 1989; Wang and 
Salam 1990) 

■ Optoelectronic implementations of neural networks (Farhat 1989; Lu et 
al. 1989; Psaltis et al. 1989; Szu 1991; Wagner and Psaltis 1987). 

In addition, a pioneering monograph on neurobiologically inspired neural 
networks is available (Mead 1989). The book is mainly devoted to microelec- 
tronic hardware emulation of a number of the powerful organizing principles 
found in biological neural systems. Integrated Circuit design techniques are em- 
ployed in the book to implement certain features of biological networks. Ex- 
amples include imitation of human senses such as hearing, vision, and other 
perceptions. Most of the circuits developed by Mead (1989) operate in the 
subthreshold range of transistors. MOS transistors in the subthreshold range 
have beneficial properties such as extremely low power consumption and the 
transistor ’s current versus voltage characteristics is described with exponential 
relationships rather than quadratic with additional saturation. 


666 


NEURAL NETWORKS IMPLEMENTATION 


As stated before, efficient neural processing hardware is crucially important 
for the success of neural network technology if we are going to take full ad- 
vantage of its leaming and processing capabilities. There is strong and growing 
activity in this area. Analog semiconductor technology, charge-coupled devices, 
and optical implementation are all among the explored technologies. Although 
semiconductor technology is presently the most widespread, the capabilities of the 
other technologies have not yet been fully evaluated for efficient implementation 
of neural network hardware. 


PROBLEMS 

Piease note that problems highlighted with an asterisk (*) are typically com- 
putationally intensive and the use of programs is advisable to solve them. 

P9.1 Analyze the neuron circuit from Figure P9.1 and compute its weight val- 
ues. Compute the neuron ’s respons e/(x) for the following inputs knowing 
that / sat+ = ~/ sat _ = 13 V: 

(a) x = [-1 3 0.5]' 

(b) x= [0 0.5 1.5]' 

(c) x= [-1 -2 -0.5]' 

(The input vector components are specified in Volts.) 

P9.2 Design the neuron circuit as in Figure 9.5 with weights of values —3, 
—4, and —6. Use the feedback resistance value of 1200 ft. By writing 
appropriate inequalities, determine in which segments of the input space, 
x , the neuron saturates and assumes the values / sat+ or / sat _. Assume 
/sa, + = "/sat- = 13 V. " 


12511 100011 



Figure P9.1 Neuron circuit for Problem P9.1. 
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Figure P9.3 Resistor-operational amplifier implementation of neuron with weights for Prob- 
lem P9.3. 


P93 


The neuron circuit shown in Figure P9.3 implements the following map- 
ping: 


/(x,y)=[wj w 2 w 3 ] 


+ [Vj 

v 2 v 3 ] 

y\ 

x 2 

y2 


.*3. 





(a) Find weights w 2 , w 3 and v x , v 2 , v 3 . 

(b) Assume that input to this single neuron network is now the vector 
with entries being voltages 


«=[*i xi x 3 y\ yi yz\* 

such that (1) | { a 1 1 = 1 and (2) the input exactly matches the total 
weight vector defined as 


W tot = [Wi W 2 W 3 V, V 2 V 3 ]' 

so that w[ ot a = ||w tot ||. Find the neuron’s output voltage /(a, w tot ) due 
to a. 

P9.4 The operational amplifier-resistor neuron circuit from Figure P9.4(a) needs 
to be analyzed. Compute f o2 as a function of w l9 w 2 , and w 3 , and of the 
bias T as indicated on the neuron symbol of Figure 9.4(b). Assume for 
computation purposes that the two neurons are equivalent. 
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(b) 


Figure P9.4 High-gain neuron with weights and nonzero threshold: (a) Circuit diagram and 
(b) symbol. 

(a) Draw the neuron’s transfer characteristics as a function of z=X;'_ , 

(b) Determine the range of activation z that allows for the linear operation 
of the neuron. Assume |/ sat _ | = |/ sat+ 1 = 14 V. 

(c) Assume that the neuron needs to classify pattems of two classes and 
find its responses due to the eight individual input values, which are 
vertices of a three-dimensional [-1, 1] cube. 

P9.5 The perceptron shown in Figure P9.5 needs to classify input pattems into 

two classes. The centers of gravity for each class are known as follows: 
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Figure P9.5 Perceptron for classification task of Problem P9.5. 


Design the minimum-distance dichotomizer that would best classify pat- 
terns from the two classes. Follow the design steps below: 

(a) Compute the weights and threshold value of the bipolar binary per- 
ceptron. 

(b) Find R u R 2 , R 3 , and R 4 for the high-gain perceptron from Figure 9.5 
that approximates the bipolar binary perceptron. 

(c) Identify the region of linear response of the designed perceptron. This 
region contains inputs that are falling between the two distinet classes 
and produce response in the range between / sat „ and / sat +. Assume 
for computations that |/ sat _| = |/ sat +| = 14 V. 

P9.6 The enhancement-mode NMOS transistor shown in Figure P9.6 is config- 
ured to funetion as a voltage-controlled synaptic weight. Voltage V G = V gs 
is the weight control voltage. It sets the drain-to-source, or channel, resis- 
tance R ds that determines the weight current, I ds . Knowing that k' — 20 
jjlA/Y 2 , W/L = 10, and V th = 1.5 V, evaluate the value of R ds as follows: 

(a) Compute and sketch R ds as a funetion of V gs for 1.5V<^<5V. 
Assume that the channel resistance is linear because of linearization 


Weight 

input 





’ Virtual 
ground 
node’ 



Figure P9.6 NMOS transistor as electrically controlled weight resistance. 
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(b) 


of I ds (V ds ) due to the limited input levels, which fulfill condition 
Kb < 0.5(V gs - V th ). 

Compute and sketch R ds = V ds 1 1 ds using the folio wing unabbreviated 
formula for I ds : 


1 ds 




Solve this part of the problem for the condition V ds = 0.5(V^ 5 — V th ). 
Again, V gs should be maintained in the range from 1.5 V to 5 V. 
Notice the increase of the computed resistance value compared to 
case (a) due to the quadratic term now present in the expression for 

^ ds * 

P9.7 The scalar product circuit with electrically tunable weights as illus- 
trated in Figure 9.11 uses R F = 10 kil and NMOS transistors with 
k f ~ 20 /xA/V 2 , W/L = 10, and V th = 2 V. The circuit needs to 
implement the function 


v 0 — —2.5*! — 6x 2 — 4x 3 
Complete the design of the circuit by computing 

(a) the gate control voltages V gsl , V gs2 , and producing the spec- 
ified scalar product 

(b) the range of input voltages x u x 2 , and x 3 ensuring the linearity 
of the channel resistance. Use the linearity condition V dsi = V ( < 
0.5(V^ 2), for i = 1, 2, 3. 


P9.8 (a) Two 3-bit MDACs are connected in a circuit as shown in Figure 
P9.8(a). Calculate the output voltage v ol assuming that x x and x 2 
are input voltages. 

(b) Calculate the output voltage v o2 for the input voltages y\ and y 2 
by analyzing the circuit shown in Figure P9.8(b). 

(c) A difference voltage amplifier is used to combine excitatory volt- 
age v ol [from part (a)] with inhibitory voltage v o2 [from part (b)] 
as shown in Figure P9.8(c). Find 

v„ 3 ( w ) = i w i w 2 ^3 w 4 ] [x, x 2 y x y 2 ]‘ 

for this neuron circuit when all operational amplifiers operate in 
their linear regions. 


P9.9 A linear activation function neuron needs to be designed using three 
weighted-resistor MDACs as shown in Figure 9.15(a). MDACs are 
controlled by 4-bit words. The neuron ’s output voltage needs to be 


= -8v,i - 4.5 v,- 2 - 2.5v,- 3 
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Figure P9.8a Neuron Circuit using 3-bit MDACs: (a) positive weight Circuit. 


provided the operational amplifier operates in the linear region. Select 
R F = 4.5/?, where R is the resistance determining the MSB switch 
current, or S 4 current. Calculate the control word settings D x , D 2 , and 
D 3 that best approximate the required output. Upon completion of the 
design, find: 

(a) the percent errors in approximation of the specified weight coef- 
ficients —8, -4.5, and -2.5 through control words D x , D 2 , and 
Z) 3 , respectively 

(b) the output voltage for v n = v i2 = v /3 = 100 mV with control 
word settings as in part (a) 

(c) the percent error of the output voltage computed in part (b) as 
compared with the accurate output voltage required. 
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P9.10 The circuit configuration shown in Figure P9.10 is used to produce 
a single digitally programmable weight. Weighted currents in vertical 
branches of the MDAC are obtained by proper sizing of enhancement 
mode NMOS transistors MA through Ml. Each of the transistors is 
assumed conductive and behaves as an integrated circuit linear re- 
sistance between the drain and source, R ds , for V gs = 4 V pro vided 
0 < v in < 1 V The threshold voltage of transistors used is V th = 1 
V, the process transconductance is k f = 20 fx A/V 2 . 

Transistor current in the resistive region is approximated by the 
expression 

hs = k'™(V gs - V th )V ds for ^ - V th 

The following design conditions need to be assumed: 

(1) ^in,max “IV 

(2) /-max = 1.S75 fi A when b l9 b 2 , b 3 , b 4 are of unity value and S 1? 
5 2 , 5 3 , and S 4 are short-circuited 

(3) V gs = AV 
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Complete the design by computing required value W IL of transistors 
Ml through M4 used to produce the weight. 

P9.ll Consider the current mirror as in Figure 9.21 with V ss = 0 V, / in = 
10 jjlA and assume transistors Ml and M2 to be identical, with V th = 
1 V, k' = 20 n A/V 2 , L = 10 and W = 40 /xm. Find V gsl = ^ s2 
and the lowest allowable output voltage v out at the present current 
level I = 10 fxA for proper operation of the current mirror. 

P9.12 The basic differential amplifier as shown in Figure 9.27(a) with dif- 
ferential input and single-ended output and resistive loads R L has 
voltage transfer characteristics as shown in Figure 9.28(a). For this 
case, compute the voltage gains A vl , and A v2 defined as 

\ dx indJ v ind -Q 


A V 2 


f d y d2 \ 

\ dv indJ v ind ~Q 


Express computed gains in terms of k, and I ss . 

P9.13 Design a template matching network using the concept of one tran- 
sistor per single bit matching element as illustrated in Figure 9.32(b). 
The input vector representing the template bit map to be matched is 
x = [ 1 0 X 1 0 0]*, where X denotes a “do not care” bit. 

P9.14 The double-ended operational amplifier shown in Figure P9.14 that 
performs as a neuron element implements simultaneously summa- 
tion and current-to-voltage conversion. It uses two matching depletion 
mode transistors with device transconductance and depletion voltage 
values k and T dep , respectively. Show that both networks of Figure 
P9.14 are identical if (9.68) holds, or 



P9.15 A simple way to reduce the nonlinearity of a weight consisting of 
an MOS transistor is to use a properly biased transmission gate as 
shown in Figure P9.15. Show that if both transmission gate transis- 
tors remain in the resistive region (NMOS: V ds < V gs — V thn , PMOS: 
V* > — V thp ), the resistance of the produced weight is equal 

A V, ~y 2 = 1 

I 2k(V g - V thn ) 
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Neuron ’s output 


Figure P9.14 Integrated Circuit feedback resistance for current summation and current-to- 
voltage conversion for Problem P9.14. 


Assume: 



(V gs - V thn )V ds - Y&. 
(Vgs - VthpWds - Y 


for NMOS device 


for PMOS device 


k k n kp , V t h V t h n V t hp 

{Hint: Assume initially V 1 > V 2 for derivation purposes. Since the 
weight circuitry is electrically symmetrical, the weight resistances 
remain identical when V x < V 2 is subsequently assumed.) 
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Figure P9.15 Integrated Circuit resistance using a transmission gate. 


P9.16 The output stage of a neuron with the sigmoidal transfer character- 
istics can be designed with two diodes connected in parallel to the 
operational amplifier feedback resistor. The diodes cause the approxi- 
mate saturation voltages of the neuron, which are substantially lower 
than the saturation voltages produced without the diodes in place. 
The resulting current-to-voltage converter that needs to be analyzed 
in this problem is shown in Figure P9.16(a). Assume that Dl and 
D2 are identical and described by the model approximated as shown 
in the box of Figure P9.16(b). Provide the answer to the problem 
in the form of the graphical characteristics of v 0 versus I for the 
current-to-voltage converter. 



p 

<0.8V 

+ 

i “ 

+ 0.8V - 
otherwise 


(b) 


Figure P9.16 Current-to-voltage converter with nonlinear feedback for saturation level control. 
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Tabulated data 
*d( v d) 

v 0 (V) 

i D (A) 

<0 

0 

0.1 

0 

0.2 

10" 6 

0.3 

10" 5 

0.4 

10“ 4 

0.5 

10' 3 

0.6 

10“ 2 

0.7 

10- 1 

0.8 

1 



Figure P9.17 The current versus voltage characteristics of diodes Dl and D2 for Problem 
P9.17. Values tabulated should be used for the piecewise approximation of the curve shown. 


P9.17 Repeat Problem P9.16 after replacing diodes Dl and D2 with the 
diode model characteristics sketched in Figure P9.17. The piecewise 
linear characteristics of the diodes as provided in the table can be 
used for obtaining the graphical solution. 

[Hint: The parallel connection of R F , Dl, and D2 may be treated as 
a single nonlinear resistor carrying current / tot . Current versus volt- 
age characteristics i to t (v Q ) of this composite nonlinear resistor can be 
obtained graphically by adding the currents through R F , Dl, and D2. 
The current-to- voltage conversion curve v 0 (I) can then be found since 

hot = 

P9.18 An analog voltage multiplier has the following transfer characteris- 
tics: 


V 0 = cv-y m , where V thd > 0 

where v,-, and v Q are input and output voltages, respectively, V thd is the 
difference in threshold voltages of the MOSFET devices used, and 
c > 0 is a constant. Measurements of the multiplier output voltage 
versus its input voltage have been taken and are illustrated in Figure 
P9.18. Based on the characteristics shown: 

(a) Compute the multiplication constant c of this multiplier. 

(b) Draw the characteristics v Q versus V thd with the input voltage as 
a parameter taking values v t = 0; 0.25; 0.5; 0.75; 1.0 ( V). 

P9.19 Analyze the neuron Circuit shown in Figure 9.53(a) and obtain its 
transfer characteristics v 0 versus the difference of total excitatory and 



678 


NEURAL NETWORKS IMPLEMENTATION 



Figure P9.18 Characteristics of an analog voltage multiplier for Problem P9.18. 


inhibitory currents as follows: 

v. =/(I/ + -S/_) 

Assume that the Circuit components have the values R F = 10 kfl, 
R x = 1 kil, and R 2 = 5 kfl. Assume also that all operational am- 
plifiers operate in their linear regions. Solve the problem for two 
separate cases as follows: 

(a) Diodes Dl and D2 are disconnected. 

(b) Diodes Dl and D2 are connected as shown and modeled by the 
equivalent Circuit as depicted in Figure P9.16(b) 

P9.20 This problem requires the analysis of the template matching network 
shown in Figure 9.32(b), for which n denotes the number of bits equal 
to 1 in the original template vector to be matched. Assume that the bit 
transistor resistances are R , and that the bias transistor resistance is 
chosen to have the value R/(n- 1). Show that the following properties 
of the network apply for this choice of component values: 

(a) v in = V DD n/(2n - 1) > /2 for a perfect match 

(b) Vj n = V dd /2 for the mismatch of a single input bit originally “0” 

( c ) v ’in = Vdd / 2 for the mismatch of a single input bit originally “1” 
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T his appendix gives a brief introduction to basic mathematical concepts used 
throughout this text. It also contains the listing of the main procedures of 
the ANS program. Reading this appendix should give the student a good under- 
standing of most of the definitions, interpretations, and operations. Reviewing the 
listing of the source code would provide the reader with an understanding of the 
programming of neural network computing algorithms in the Pascal language. 
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VECTORS AND MATRICES 


Definition of Vectors 

Consider n real numbers (scalars) x x , x 2 , . .., x n . These numbers can be 
arranged so as to define a new object x, called a column vector: 

r*ii 


Al 
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A lowercase boldface letter will always denote a column vector in this text, for 
example, x, y, w. The numbers x x , x 2 , are the components of the vector. 

The vector itself can be called an n-component, n-tuple, or n-dimensional vector. 

We can associate with the x vector its transpose vector ; denoted by x l and 
defined as follows: 

*' =[*i x 2 ... x„] (Al. 2) 

where superscript t denotes the transposition. Note that x' is a row vector. 


Operations on Vectors 

Consider the column vectors 



V 


>r 


'x, 

+ yi 

X = 

x 2 

. y = 

y 2 

A 

, z = 

x 2 

+ J2 


-Xn- 


.y«. 


-X„ 

+ • 
1 


(Al .3) 


We define the sum of the column vectors x and y, x + y, to be the column vector 
z as in (Al. 3). Note that adding two n-component vectors requires adding their 
respective components. 

The product ofa column vector x by a scalar c is defined as a vector whose 
components are products cx { such that 



V 


" CXy " 

A 

x 2 


cx 2 

X 

II 

; 


: 


.x n . 




Linear Dependence of Vectors 

Consider a set of n-dimensional vectors x l9 x 2 , . x m used to construct a 
vector x such that 

m 

X = C]X, + c 2 x 2 + . . . + c m \ m = s c.-Xf (Al .5) 

1 = 1 

where c l9 c 2 , . . . , c m are scalars. The vector x is a linear combination of vectors 
Xp for i = 1, 2, . . . , m. A set of vectors x 1? x 2 , . . . , x m is said to be linearly 
dependent if numbers c {9 c 2 , . . . , c m , exist and are not all zero, such that 

c,x, + c 2 x 2 + . . . + c m \ m = 0 (Al .6) 

If vectors x 1? x 2 , . . . , x m are not linearly dependent, we say that they are linearly 
independent. 
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EXAMPLE A1.1 


Consider three vectors 



' f 


' 0‘ 


r 21 

X 1 = 

-1 

0 

, x 2 = 

-2 

1. 

. x 3 = 

1 

cn 

1 

1 


Since 2x, + (-3)x 2 + (-l)x 3 =0, the vectors are linearly dependent. 
Consider three vectors 



V 


'0' 


'0‘ 

= 

0 

0 

. X 2 = 

1 

0 

1 x 3 = 

0 

1 


No constants, c,, i = 1, 2, 3, which are not all zero, can be found such that 
CjXj + c 2 x 2 + c 3 x 3 = 0. Therefore, vectors x 1? x 2 , and x 3 are considered to 
be linearly independent. ■■ 


Definition of Matrices 


A rectangular array A as in (Al .7) of n-m numbers a tJ , where i = 1 , 2, . . . , n, 
and j — 1,2 , m, arranged in n rows and m columns is called an nXm matrix 


a 11 

a \ 2 • 

^1 m 

a 2\ 

a 22 - 

®2m 

a n\ 

a n2 ' 

&nm 


(Al .7) 


We denote matrices by uppercase boldface letters, for example, A, W, V. 

We can associate with the A matrix its transpose matrix , denoted by A r and 
defined as follows: 


a u 

«2l ‘ 

a nl 

a \2 

a 22 ' 

' * a n2 

^1 m 

a 2m ' 

a nm 


(Al -8) 


Note that A l is formed from the matrix A by interchanging its rows with columns. 
Therefore, the first column of A becomes the first row of A r , the second column 
of A becomes the second row of A r , and so on. Thus, is an m X n matrix. 
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Operations on Matrices 

Consider two nX m matrices, A and B, with their respective elements a ^ 
and b ( j , for i = 1, 2 and j = 1, 2, . . . , m. The swn of matrices A and B 
is the w X m matrix whose elements are the sums of the corresponding elements: 


A + B 


a n a l2 * • • a lm 


b u b l2 • • ■ b lm 

a 2l a 22 ‘ * a 2 m 

+ 

l>2\ b 22 ' ' ‘ b 2m 

- a n\ a n2 *** a nm . 


. b n \ b„2 • b nm . 

a n + b n a 12 + b 12 

a \m + b Xm ' 

a 2\ ^21 a 22 ^22 

' ‘ ‘ a 2m + b2m 

- a n\+ b nl a n2+ b n2 

® nm + b nm . 


(Al .9) 


The product of an n X m matrix A by a scalar c is defined as an n X m 
matrix whose elements are products ca Vj such that 


(Al. 10) 


The product AB of an n X m matrix A and of an m X q matrix B is the 
nXq matrix C such that 



a u 

a \2 ' ' 

@lm 


ca n 

ca n ’ 

ca^m 

cA = c 

a 2\ 

a 22 ' ’ 

* a 2m 

= 

Ca 2\ 

ca 22 • 

* * ca 2m 


a n 1 

a n2 ' ' 

®nm 


ca n \ 

™ n 2 ' 

c ®nm 


c ik ” 5L a ijbjk 

y= i 


(A1.11) 


for i = 1, 2, . n, and k = 1, 2, In matrix notation, the product is 

written 

C = AB (Al. 12) 

Formula (Al. 11) indicates that the element c ik at the intersection of the V th row 
and Æ’th column of the product matrix is computed by multiplying the /’th row 
of A by the k ' th column of B. Note that for the product AB to make sense, the 
number of columns of A must be equal to the number of rows of B [both are 
m in expression (Al. 11)]. The reader should notice that matrix multiplication is 
not commutative. 
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EXAMPLE 


Al .2 


Let us illustrate the use of (Al. 11) by multiplying two matrices as shown 
below: 


‘<*11 

a l2 

,B = 

a 2l 

a 22 

.<*31 

<*32. 



*12 

* 22 . 


Matrix A is 3 X 2, matrix B is 2 X 2, thus the product matrix is 3 X 2: 


AB = 

a n 

a 2\ 

<2,2 
a 22 

’*n 

.*21 

*12 

— 

a u b n + < 2 , 2^1 
<*21 *11 + <*22*21 

<* 11*12 + <* 12*22 
<*21*12 <*22*22 


, a 3l 

a 32. 

°22\ 


.<* 3 i*n + a 22 b 2l 

<*31*12 + <*32*22. 


Note that, for example, element c 12 of C computed from (Al. 11) is 


2 

c 12 = X <*1 jbj2 = <*11*12 + <*12*22 

j= 1 

and the element c 2X computed from (Al. 11) is 

2 

C 21 = X a 2jbjl = <* 21*11 + <* 22*21 

7=1 


Multiplication of Matrices and Vectors 


An w-tuple vector x defined in (A 1.1) can be interpreted as a matrix with m 
rows and one column. Let us consider the problem of postmultiplying an n X m 
A matrix with a vector x, which is equivalent to an m X 1 matrix. If we let 


y = Ax 

or, in an expanded form 


>1 

y 2 

_ 

<*11 <*12 ' 
<*21 <*22 ‘ 

' <*lm 
‘ <*2 m 


- * Jf 

1 

'c . 
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-<*nl <*«2 " 

<* nm - 


, 

L-^m J 


(Al. 13a) 


(Al. 13b) 


then we can compute components y ; of the vector y using (Al.ll) as follows: 


yt 


S OijXj, 
7=1 


for i = 1,2,...,« 


(Al. 14) 


Note that the vector y resulting as a product is n-dimensional. 
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EXAMPLE Al .3 


Let us illustrate the multiplication of a 3 X 2 matrix A by a two-component 
vector x. In the general case, we have 



a \\ 

a \2 


1 I 

£ X 
1 1 


Q\\X\ "i" ®\2 X 2 

II 

> 

X 

II 

a 2\ 

a 22 


= 

^21-^1 + a 22 x 2 


_ a 31 

a 32. 



. a 2>\ x \ + a 32 x 2. 


Symmetric Matrices and Their Properties 

Let us suppose that A is an n X n matrix. Such a matrix is called a square 
matrix since it has the same number of rows and columns. Let us note that a 
square matrix can be symmetric. This property applies when 


A' = A 


(Al .15) 


A square matrix A whose elements not on the main diagonal are zero is called 
a diagonal matrix. The general form of a diagonal matrix is 


a \\ 

0 


0 

a 22 


0 

o 


o o 


(Al. 16) 


The unity or identity matrix denoted as I is a special case of the diagonal matrix 


(Al. 17) 


in (Al. 16) such that a u = 1, for i 

~ 

1, 2, ..., 

n. 


'1 

0 ••• 

0' 

I = 

0 

1 ••• 

0 


0 

0 ••• 

1 


The determinant of a square matrix is a scalar and is denoted by 


detA = 


a n 

a \2 ' 

•• au 


a 2\ 

a 22 ' 

‘ ‘ a 2n 

(Al. 18) 

®n\ 

a n2 ' 

d nn 



(It is assumed that the reader is familiar with the computation of determinants.) 
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Norms, Products, and 
Orthogonality of Vectors 


The Euclidean norm, or length, of an n-tuple vector x is denoted as ||x|| 
and defined 

||x|| = (x'x) ,/2 (Al. 19) 


Note that the Euclidean norm as in (Al. 19) can be computed by using the 
concept of matrix multiplication. Indeed, for an n-component vector x we have 
from (Al. 19) 


x = 


[*1 *2 ••• *„] 


_ 

\ 

1/2 

X, 

*2 

_ / 

; 


v 

1 

* 

) 

V 


Ixf 

1=1 


(Al .20) 


This result is in agreement with the principles of matrix multiplication rules 
(Al. 11) and (Al. 12) for multiplication of a 1 X n matrix by an n X 1 matrix. 

The scalar ( inner ; dot) product of two n-component vectors x and y is the 
scalar (number) defined as 


n 

x'y = + X2 y 2 + ...+ x n y n = £ x,y, (Al .21) 

1=1 

Note that both vectors x and y are column vectors and that their scalar product 
is computed as a sum of n products of corresponding components. 

An important property as below holds for the scalar product that links the 
Euclidean norms of vectors x and y and the angle between them: 


x'y = IMI • llyll cos ^ 


(Al .22) 


This property is illustrated in Figure Al for the two-dimensional case of vectors 
x and y, where 


x = 



yi 

J2. 


(Al .23) 


The reader should note that the scalar product operation on two vectors is com- 
mutative. We thus have x'y = y'x. In the text the scalar product of input and 
weight vectors is often denoted by net , which is the argument of the activation 
function. 

Vectors x and y are said to be orthogonal if and only if their scalar product 
is zero. There is a simple connection between linear independence and orthog- 
onality. If the nonzero vectors x 1? x 2 , ..., x„ are mutually orthogonal (every 
vector is orthogonal to each other), then they are linearly independent. The three 
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Figure Al Illustration for the scalar product of two vectors, x and y. 


vectors of Example Al.l 



' 1 ' 


' 0 ' 


" 0 " 

X 1 = 

0 

0 

. x 2 = 

1 

. 0 . 

. x 3 = 

0 

_ 1 _ 


are examples of three mutually orthogonal vectors. They also are linearly inde- 
pendent. 


EXAMPLE Al .4 


Let us find the scalar product of two vectors x and y, their lengths, and the 
angle between them. The vectors are 



’2" 


r 

X = 

1 

2 

. y = 

-i 

4 


The definition (A1.21) yields the following scalar product value 
x'y = 2- 1 + 1 -(-1) + 2-4 = 9 

The angle between the vectors can be found from expression (Al. 22) as 
follows 
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Since the lengths of the vectors x and y computed from (Al. 20) result as 
||x|| = \/2 2 + 1 + 2 2 = 3 
||y || = %/l + 1 + 4 2 = \/l8 

we obtain from (Al. 24) 

, 9 1 
C0S * = TM = ~h 

3vlo V 2 

We thus conclude that ifs = 45°. ■ 


The øwter (vector, cross) product of two /i-component vectors x and y is 
defined as 

x X y = xy' (Al .25) 


Note that the outer product is an n X n matrix as shown below: 


V 



x iyi ■ 

•• x \y n 1 

x 2 

bi yi ••• y„] = 

x iy i 

x iyi ■ 

c 

^ . . . 

x„. 


. x n y\ 

x n yi ■ 

•• x n y n J 


(Al .26) 


Note that the outer product of two identical /i-tuple vectors results in symmetrical 
nXn matrix. 


Linear Systems, Inverse and 
Pseudoinverse Matrices 

Consider a linear system of n equations with m unknowns x x , x 2 , . . . , x m . 
The general form of the equation system is as in (Al. 13) 

y = Ax (Al .27) 

In a case where m = n and det A + 0, the exact solution of (Al. 27) is 

x = A~ l y (Al .28) 

where A -1 is the inverse matrix. In a case where m < n, equation (Al. 27) does 
not have an accurate solution for x, We may therefore be interested in obtaining 
the approximate solution, x, such that it minimizes the error ||Ax — y||. This 
solution is 

x = A + y (Al .29) 

In the case when m > n, equation (Al. 27) has an infinite number of solutions. 
Again, we may be interested in the solution, x, which has the smallest norm, 
i.e., ||x|| < ||x|| for each x such that Ax = y. This solution is 

X = A + y 


(Al .30) 
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In both cases (Al. 29) and (Al. 30) matrix A + is called the pseudoinverse of 
matrix A. 

When ra<nwe have the pseudoinverse matrix of value 

A + = (A'A)-’A' (Al .31) 

When m> n we have the pseudoinverse matrix of value 

A + = A'(AA') - 1 (Al .32) 
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QUADRATIC FORMS AND 
DEFINITE MATRICES 


Definition of Quadratic Forms 


Quadratic form £(x) of a vectorial variable, x, is of fundamental importance 
for studying multi-input/multi-output physical systems. The quadratic form is 
a scalar-valued function of x defined through a symmetric n X n matrix A as 
defined in expression (Al. 7). The quadratic form is equal 


which is equivalent to 

E(x) = [ jcj x 2 . . . 


E(x) = x'Ax 



1 

to 

S* 

1 


W 

X„] 

a 2X a 2 2 ■ 

• a 2n 


x 2 


_a„ x a n 2 

® nn - 


1 

1 


(A2.1) 


(A2.2) 


Assuming the symmetric matrix A and performing the multiplication from left 
to right of matrices 1 X n by n X n by n X 1 yields the scalar function 1X1, 
which is the polynomial in the form 


E(x x ,x 2 , ...x„) = a n xf + 2a l2 x x x 2 + 2 a x3 x x x 3 + . . . + 2a Xn x x x n 

+ a 22 xl + 2a 23 x 2 x 3 + . . . + 2a 2n x 2 x n + 1- a nn x^ 

Note that the assumption about the symmetry of matrix A causes all terms with- 
out the squared variables x h for i = 1, 2, . . . , n, to be doubled in (A2.3) since 
a V] — ciy r If the matrix A were not symmetric, we would have terms (a^ + a^pCj 
for all terms of (A2.3) without the squared variable. In short, the quadratic form 
of (A2.2) or (A2.3) can be rewritten as 

E(X) = X X a ij x i x j 


(A2.4) 


A2 QUADRATIC FORMS AND DEFINITE MATRICES 


All 


EXAMPLE A2.1 


Let us compute the quadratic form E(x) defined as in (A2.1), where 

, A = 


x = 


X\ 

yx 2 


1 2 
2 3 


Using either of the formulas (A2.2), (A2.3), or (A2.4) yields the quadratic 
form as 

E(x) = x\ + 4x t x 2 + 3x^ mm 


Properties of Quadratic Forms 

Let us define the most important properties of quadratic forms related to 
their value in terms of the vectorial variable, x. If for all nonzero x (x =£ 0) 
E(x) = x r Ax is 


(a) non-negative, then E(x) is called a positive semidefinite form and A is 
called a positive semidefinite matrix. 

(b) positive, then E(x) is called a positive definite form and A is called a 
positive definite matrix. 

(c) non-positive, then E(x) is called a negative semidefinite form and A is 
called a negative semidefinite matrix. 

(d) negative, then £(x) is called a negative definite form and A is called a 
negative definite matrix. 


Below we outline the procedure for the testing of quadratic forms for conditions 
(a) through (d). We begin with case (b). 

The form E(x) is said to be positive definite if and only if, for the symmetric 
matrix A 

T a n a n * * * a \ n] 


A = 


a n a 22 


a 2n 


(A2.5) 


i a ln ^2 n ' ' ’ & nn J 


its principal determinants det A u , det A 22 , . . . , det A nn , defined as below fulfill 
the conditions 

detA n = |a n | > 0 


dety4 22 = 


a i\ a n 
a \2 a 22 


> 0 


(A2.6) 


detA„„ = det A > 0 
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The test conditions are tabulated below for all relevant cases. Similar to (A2.6), 
the test involves evaluation of the signs of the principal determinants of matrix A. 


Case 

Class 

Test 


(a) 

Positive semidefinite 

detA,, ^ 0 , d etA 22 — 0 , . 

. , det A nn > 0 

(b) 

Positive definite 

detA n > 0, detA 22 > 0, . 

. , det A nn > 0 

(c) 

Negative semidefinite 

detA n < 0, detA 22 ^ 0 . . 

. etc. 



(note altemating signs) 

(A2.7) 

(d) 

Negative definite 

detA n < 0, detA 22 > 0, . 

. . , etc. 



(note altemating signs) 


(e) 

Indefinite 

none of the above. 



Note that the quadratic form E(x) is said to be indefinite if it is positive for some 
x and negative for other x. This points out that the sign of an indefinite quadratic 
form depends on x. 


EXAMPLE A2.2 


Let us look at an example of a quadratic form with a symmetric 2X2 
matrix A such that its diagonal elements are zero: 


x'Ax = [xj 


0 a 


’xi' 

a 0 


, x 2. 


Since we have 


detA H = 0, detA 22 = — a 2 < 0 

we can conclude that the evaluated quadratic form is indefinite. Indeed, 
multiplication of matrices or using expansion of the evaluated form as in 
(A2.3) yields the quadratic form 


E(x) = [ax 2 



= 2ax x x 2 


The sign of the resulting function E(x) depends on x. If x x and x 2 are of 
identical signs, E(x) remains positive, otherwise it is negative. Therefore, 
vectors x from the first and third quadrant render the discussed form positive, 
otherwise it is negative. wm 
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TIME-VARYING AND GRADIENT 
VECTORS, JACOBIAN AND 
HESSIAN MATRICES 


Time-Varying Vectors 

A time-varying vector x(t) is defined as a column vector (Al.l), but its 
components are themselves functions of time, i.e., 

'x i(0" 


x(0 = 


X 2 (0 


X n (t) 


The time derivative of the vector x(f) is defined as 

dx x (t) 

dx(t) 
dt 



or in a shorthand notation as 


x(0 = 


~ X\ (t) 

x 2 (t) 


(A3.1) 


(A3.2a) 


(A3.2b) 


U« (0. 


Gradient Vectors 


Consider a scalar function £(x) of a vectorial variable x, which is defined 
as an n-component column vector as in (Al.l). The gradient vector of E(x) with 
respect to the column vector x is denoted as V x E(x) and is equal 

MEl 


V x E(x) = 


rlv, 

3 £ 

dx 2 


(A3.3) 


a£ 

dx n 


Note that the gradient is also an n-tuple column vector. 
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Figure A2 Illustration for the gradient vectors. 


Notice that the immediate consequences of the definition of the gradient 
vector as in (A3.3) are 

V,(x'y) = y (A3.4a) 

V,(x'Ay) = Ay (A3.4b) 

V^x'Ax) = Ax + A'x (A3.4c) 

When dealing with quadratic forms and when A is a symmetric matrix, we have 
A = A! and (A3.4c) simplifies to 

V*(x'Ax) = 2Ax (A3.5) 

The geometrical interpretation of a gradient is often useful. Figure A2 shows 
a function E(x) with x restricted to two components, x Y and x 2 . Vector \ x E(x) at 
a given point x 0 specifies the direction of the maximum increase of E(x). Note 
that at the maximum point, x*, we have 

V*£(x*) = 0 (A3.6) 

The property of the zero gradient also holds at a minimum point. 


Gradient of Time-Varying Vectors 

Assume that x is a time-varying n-component vector and that a scalar func- 
tion E[x(t )] (not necessarily a quadratic form) is defined for such a vectorial 
variable. The time derivative dE[x(t)]/dt for the evaluated scalar function can 
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be computed by the chain rule as follows 


dE[x(t)\ = d£ dx } (Q + ø£ _ dx 2 (t) + + d£ _ dx n (t) ^ 

dt dx x dt dx 2 dt " ' dx n dt 


or, briefly 


dE[x(t)] n dE . /A 

— T — = I T~ x ‘ (r) 

dt i=i dXj 


(A3.7b) 


We may notice from (A3.3) that the derivative (A3.7) can be expressed succinctly 
using the gradient vector as follows: 

dE[x(t)] 


dt 


= [V t £(x)] r x (0 


(A3.8) 


Here we have used the notation of (A3.2b) for brevity. 

EXAMPLE A3.1 

Consider that the following quadratic form with the matrix A as in Example 
A2.1 is to be evaluated for changes in time: 

E[x(t)] = {e 2t t] 

Notice that in our example we have 

x(0 = 


1 2 


(N 

2 3 


* J 


’*i (t) 



x 2 (t) 




Using expression (A3.8) allows the result to be rewritten as a scalar product 
of the gradient vector 

' d_ 
dx x 
d_ 

dx 2 

times the vector x (t) of value 

’ 2e 2t 1 


V x E(x) = 


-{x\ + 4 x x x 2 + 3xf) 

i 

-(x 2 + 4jcj jc 2 + 3x%) 


2jcj + 4x 2 
[4x t + 6x 2 \ 


x(t) = 

Combining the two results we obtain 

dE[x(t)] 


1 


dt 


= [le 2t + 4 1 4e 2t + 6 1 ] 

= 2[2e 4 ' + (4 1 + 2)e 2 ' + 3 1] 


2e 2 ' 

1 
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Jacobian Matrix 

Let us consider m functions £,(x), for i = 1, 2, m, of the vectorial 
variable x defined as 

£,(x) = E x (x x ,x 2 ,... x n ) 

E 2 (x) = E 2 (x x ,x 2 ,...x n ) 

(A3.9) 


E m (x) = E m (x x ,x 2 ,...,x n ) 

The matrix Æ/dx called the Jacobian matrix and denoted J(x) can be defined 
for (A3.9) as: 


J(x) = 


' dE x 

dE x 

dEi 

dx x 

dx 2 

dx n 

dEi 

dE 2 

d h 

dx x 

dx 2 

dx n 

dE m 

d Iin 

dE m 

dx x 

dx 2 



(A3.10) 


EXAMPLE A3.2 


Let us compute the Jacobian matrix at x = [ 2 1 | ! for the following 

functions of f^x) and E 2 (x): 

E x (x) = - x x x 2 

E 2 (x) = x x x 2 + x\ 

The Jacobian matrix expressed from (A3.10) is 

J(x) = 


$ 

1 

£ 

-■*1 


‘3 

-2' 

L *2 

x x + 2x 2 _ 


1 

4 


Hessian Matrix 

For a scalar-valued function E(x), a matrix of second derivatives called the 
Hessian matrix is defined as follows: 

V 2 £(x) = V, [V x E(x)\ 


(A3.11) 
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d 2 E 

d 2 E 

d 2 E 


dx x dx 2 

dx x dx n 

d 2 E 

d 2 E 

d 2 E 

dx 2 dx l 


dx 2 dx n 

d 2 E 

d 2 E 

d 2 E 

dx n dx { 

dx n dx 2 

dx 2 


(A3.12) 


Note that the Hessian matrix is of size n X n and is symmetric. The matrix is 
often denoted by H. 


EXAMPLE A3.3 

Let us compute the Hessian matrix and determine whether it is positive 
definite for the function 

E(x) = (x 2 - Xj) 2 + (1 - xfj 2 
Computing first the gradient vector yields 

V x E(x) = 2 “ 1 

L x 2 - x \ J 

and the Hessian matrix of size 2X2 becomes 

V 2 £(x) = 2 

Since the principal determinants of the matrix are 4 and 2, the Hessian 
matrix is positive definite. Note that the Hessian matrix is constant in this 
example and independent of x. ■■ 
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SOLUTION OF 
OPTIMIZATION PROBLEMS 

Unconstrained Optimization Problem 

An optimization problem is usually defined in terms of the minimization of 
a scalar function of a number of variables. If the variables are not constrained by 
inequalities or equality relationships, the optimization is said to be unconstrained. 
The function to be minimized is an objective function, sometimes also called 
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a cost function or criterion function. In neural network studies, the objective 
func tions take the forms that folio w: 


1. For feedforward networks: The scalar error function £(w) in the weight 
space is the objective function. 

2. For recurrent networks: The scalar energy function E(\) in the network 
output space is the objective function. 


For a scalar valued objective function E(x) of a single variable x, the well-known 
conditions for a minimum at x = x* are 


dE{x ) 
dx 

d 2 E(x*) 
dx 2 


= 0 
>0 


(Å4.1) 


For a scalar valued function E(\) of a vectorial variable x we generalize condi- 
tions (A4.1) as given below: 

V x E(x*) = 0 (A4.2a) 

V 2 E(x*) is positive definite (A4.2b) 

Condition (A4.2a) requires the gradient vector at a minimum of £(x) to be a 
nuil vector. Condition (A4.2b) requires the Hessian matrix at a minimum of E(x) 
to be positive definite. The evaluation of conditions (A4.2) is possible only for 
continuous functions involved in expressions for E(x ), and in matrices V x E(x) 
and V 2 E(x) for all values of x. 

To derive conditions (A4.2), let us inspect a two-variable function £(x) 
expanded in a Taylor series at x = x* and retain the linear and quadratic term 
of the expansion: 


EY \ ~ EY *\ 4 .( *,d£(X*) , ( */ Æ ( X *) 

£(X) — £(X*) + (JCj - — + (x 2 - x 2 *)- 


dx 


&X 2 

^2i 




dX2 


(A4.3a) 


+ (xj - x t *)(x 2 - x 2 *) 


d 2 E(x*) 

()X i f)X 2 


Equation (A4.3a) rewritten in the matrix form becomes for Axj = x, — x,* and 

Ax 2 = x 2 - x 2 *: 


£(x) = £(x*) + 


1 

+ 2 


' dE(x*) 

d£(x*) ' 

Ajtj 

dxi 

9x 2 . 

m åx 2m 



" d 2 E(x*) d 2 £(x*) " 


Axj 

t 

dx 2 dXi dx 2 


‘AV 

Mi. 


d 2 E(x*) d 2 E(x*) 


_ Ax 2 . 


dx 2 dx 1 dx 2 



(A4.3b) 
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Using the gradient vector (A3.3), Hessian matrix (A3.12), and the notation 

A 


Ax 


Ajcj 


we obtain the following matrix form of (A4.3b): 

E(x) = E(x*) + [V^£(x *)] 1 Ax 4- ^Ax ? V^E(x*) 


Ax 


(A4.3c) 


If x* is to be a minimum of E(x) as in (A4.3c), any infinitely small change Ax 
at x* should result in £(x* + Ax) > £(x*). This requires the following: 


1. The gradient vector at x* vanishes and makes the linear term in (A4.3c) of 
zero value, and 

2. The Hessian matrix at x* is positive definite, which makes the last term in 
the quadratic form (A4.3c) positive independently of Ax. 


Analyticai Solution 
of a Selected 
Optimization Problem 

Suppose that 

E(x) = “x'Ax + b'x (A4.4) 

where x is an ^-dimensional column vector; A is an n X n symmetric, positive 
definite matrix; and b is an ^-dimensional constant vector. 

The scalar multivariable function E(x) as defined in (A4.4) is one of the 
few functions for which an analyticai solution for the unconstrained minimum 
exists. Computing the gradient vector and the Hessian matrix of E(x) yields: 

V x E(x) = Ax + b (A4.5a) 

V^E(x) = A (A4.5b) 

Setting condition (A4.2a) for the minimum at x*, we obtain the equation for x*: 

Ax* + b — 0 (A4.6a) 

which has the solution 

x* = -A _1 b (A4.6b) 

Note that the Hessian matrix as in (A4.5b) has been assumed to be constant and 
positive definite; therefore, the solution found at x* in (A4.6b) is a minimum. 
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EXAMPLE A4.1 


Let us evaluate the location of the minimum of the following function having 
the form of Equation (A4.4): 


£(x) = - [*, 


Xl. 


+ [1 0 ] 


*1 

x 2 \ 


This quadratic form, supplemented by the linear term, is equivalent to the 
expanded polynominal of two variables: 

£(x) = + 2x x x 2 + 4x2 + x i 


Equation (A4.5a) takes the form: 



Solving it yields vector x = x* such that 


J2~ 

3 

1 

. 6 . 

Checking the principal determinants of the Hessian matrix, A, we obtain 


*!* 

1 

4 

-1 


1 

*2*. 

6 

-1 

1 


0 


detA u = 2 > 0 

detA 22 = 2*8-4 = 12 >0 

Therefore, because the Hessian matrix is positive definite, the point x* is 
the minimum of E(\). 

Note that selecting an arbitrary Ax value yields a vanishing linear term 
and the quadratic term of the series expansion (A4.3c) is equal: 


Axj 


Ax 2 


1 f 


Axj 

1 4 


Mi. 


= (Ax^ 2 + 2 Ax x Ax 2 + 4(Ax 2 ) 2 


This result is equivalent to the expression 

(Axj 4- Ax 2 ) 2 + 3(Ax 2 ) 2 > 0 

which is always positive if Ax x and Ax 2 are not both equal to zero. 


Geometrical Interpretation 

Note that the gradient vector nullification at x = x* as in (A4.2a) is nec- 
essary for the existence of the minimum at x*. The same property of gradient 
vector nullification, however, is also shared by maxima and saddle points. This 
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is illustrated for a function E(x { , x 2 ) of two variables in Figure A3. The figure 
shows the shapes of E(x l9 x 2 ) for the case of a minimum (a), and of a saddle (b). 

Assume that the functions represented by the surfaces shown are quadratic 
forms with the linear term in the form as in (A4.4). Matrix A must be positive 
definite for the function shown in Figure A3(a). The figure illustrates conditions 
described in Example A4.1 with the minimum at x = x* = [—2/3 l/6] f . Fig- 

ure A3(b) displays contour maps showing constant value lines for the function 
from Figure A3(a). 

A quadratic form is, however, indefinite for the function depicted in Figure 
A3(c) because it can take either sign depending on Ax. The function itself is 
shaped like a saddle. The function increases in one direction Ax, but decreases 
in the other and the form is said to be indefinite at the saddle point. Also, matrix 
A is indefinite at the saddle. 

The reader may verify that if A is chosen in Example A4.1 as 



2 " 

~8 


matrix A becomes indefinite based on the test from Section A2.2 since 


detA n = 2 > 0 
detA 22 = -20 < 0 

The saddle point is the solution of the equation 


which is 


'2 

2" 


V 

+ 

r 

.2 

-8. 


.* 2 . 

. 0 . 



0 


This case of a two-variable function is depicted in Figure A3(c). Figure A3(d) 
displays contour maps for the function from Figure A3(c) that possesses a saddle. 


A5 — 

STABILITY OF NONLINEAR 
DYNAMICAL SYSTEMS 


Preliminary Considerations 

Late in the nineteenth century, the Russian mathematician A. M. Lyapunov 
developed an approach to stability analysis of dynamical systems. The unique 
feature of the approach is that only the form of the differential or difference 
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Contour map of constant E 

EC xl f x2) 



Figure A3 Geometrical interpretation of minimum and saddie point: (a) function from Example 
A4.1 with minimum, (b) contour map for (a), (c) function with saddie point, and (d) contour map 
for (c). 


equations needs to be known, not their solutions. The method, called Lyapunov’s 
method , is based on the generalized energy concept. The method requires eval- 
uation of a so-called Lyapunov function. Consider first an analogy between 
Lyapunov’s function and the energy function of a physical system. 




Figure A4(a) shows an autonomous (i.e., unforced) second-order electric 
system consisting of a series RLC connection. In this case, capacitor voltage v 
and inductor current i can be defined as State variables, and x is the State vector 
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The total physical energy stored in the system is known to be 

E(x) = l -Cv 2 + l -Li 2 = i [v 

Thus the energy function is a quadratic function of State variables and 

E(x) > 0 


C 

0' 

V 

0 

L 

i 


(A5.2) 

(A5.3) 


unless x = 0. We can, therefore, say that the energy function is positive definite. 
When the energy function rate of change in time is evaluated we notice that 


x(0 = 

V(t) 

. i(t) . 

(A5.4a) 

and from (A3.7a) we obtain 



dE(x) dE(x) 

. 3£(x) . 

(A5.4b) 

dt dx Y 


which is equivalent to 



d T - [v - £ wi' 4 

(A5.4C) 

Using (A5.2) the derivative of the energy function versus time becomes 


dE(x) p ^ -] 

^ [cv u] 

Y = Cv v +L/ i 

i 

(A5.5) 


Knowing the form of the differential equations describing system x x — /j(x), 
i 2 = / 2 ( x ) is usually sufficient for evaluation of the expressions for £(x) and 
dE(x)/dt. For our sample Circuit from Figure A4(a) the equation describing the 
system is 


L di(t) 

dt 



+ Ri(t) = 0 


(A5.6a) 


This integrodifferential equation can be rewritten as a system of two first-order 
equations using the State vector as defined in (A5.1): 

dv{t) _ i(t) 

~dt~ “ ~C 
di(t) _ _ v(t) Ri(t ) 

dt L L 


(A5.6b) 


which is equivalent to 




i 


r • 


0 


_ _ 

V 


c 


V 

i 


1 R 


i 



1 

1 



(A5.6C) 


Now we can evaluate the sign of the rate of change of £[x(r)] versus time 
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using (A5.5): 

= i(t)v(t) - i(t)v(t) - Ri\t) (A5.7a) 

which reduces to 

= —Ri 2 (ø < o (A5.7b) 

at 

The discussed energy function always decreases in time unless i — 0. The result 
(A5.7b) can be easily interpreted physically. The Circuit of Figure A4(a) repre- 
sents a lossy resonant Circuit. Once the circuit is energized with v(0) ^ 0 or 
/(O) + 0, it starts producing damped oscillations as illustrated in Figure A4(b). 
The damping phenomena are a result of the nonzero resistance value in the 
circuit, which makes the system actually dissipative. 

The evaluated solution for v(r) and i(t) can also be represented as a trajec- 
tory v(/), with the time variable eliminated. The damped oscillations from Figure 
A4(b) take the form of a State trajectory as in Figure A4(c), which is drawn to the 
origin after a sufficiently long time has elapsed. Again, neither the time-domain 
solution nor knowledge of the trajectories is required to prove that the system is 
stable. Evaluation of the energy function alone allows the stability of the system 
to be proven or disproven. 


Lyapunov’s Theorem 


Consider the autonomous (i.e., unforced) system described with a system of 
n first-order linear or nonlinear differential equations: 

=/i(x) 

*2 = A( x ) 

(A5.8a) 


which is equivalent to 


K =/«( x ) 

x= f(x) 


(A5.8b) 


We assume that the equations have been written so that x = 0 is an equilibrium 
point that satisfies equations f(0) = 0. We formulate a condition for the equilib- 
rium x = 0 to be asymptotically stable which means that the State vector goes to 
zero as time goes to infinity. The argument goes like this: If a positive definite 
function E(x) can be found such that 


1. E is continuous with respect to all components x t ,for i = 1, 2, ... , n , and 

2. dE[x(t)\ / dt < 0, which indicates the energy function decreasing in time, 
then the origin is asymptotically stable. 
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Figure A5 Illustration of the Lyapunov theorem: (a) stable trajectory, (b) unstable trajectory, and 
(c) respective positions of velocity and gradient vectors. 


Function E(x) satisfying the above requirements is called a Lyapunov function. 
The function is not unique; rather, many different Lyapunov functions can be 
found for a given system. If at least one function is known that meets conditions 
1 and 2 given above, the system is asymptotically stable. This means that its 
attractor is a point rather than a bounded trajectory. Figure A5(a) illustrates 
asymptotical stability, which we simply call stability in our neural network stud- 
ies. This trajectory is of the same type as that shown in Figure A4(c) representing 
the decaying oscillations from Figure A4(b). Figure A5(b) shows an unstable tra- 
jectory of a system with an increasing energy function such that dE[x(t)] /dt > 0. 
This case would correspond to the negative resistance value in (A5.7b) and to 
the energy function increasing in time. 
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Note that modifying condition 2 so as not to have strictly negative but non- 
positive derivatives of the energy function would allow for so-called limit cycles, 
which correspond to limited but nondamped and steady oscillations. Indeed, for 
the example Circuit from Figure A4(a) with R = 0 we obtain from (A5.7b): 


dE(x) 

dt 


(A5.9) 


In this case the energy function remains constant in time. This corresponds to the 
lossless LC resonant circuit, which produces nondecaying sinusoidal oscillations. 
Since the circuit does not dissipate its energy in the resistance, the oscillations are 
of constant amplitude. This case corresponds to a circular trajectory on the plane 
x lf x 2 rather than decreasing (stable) or increasing (unstable) spiral trajectories 
on the State plane. 

Figure A5(c) provides an interesting observation relating the energy function 
E(x) and the trajectories of the stable system. Let us note that based on (A3.8) 
condition 2 is equivalent to 

= rv c £(x)]' X (o (A5.10) 


which is the scalar product of the energy function gradient vector and of the 
velocity of the State vector x (£). The velocity vector is obviously tangent to 
the trajectory. The gradient vector is normal to the energy contour line. These 
conditions are illustrated in Figure A5(c). Since stability condition 2 requires the 
scalar product (A5.10) to be negative, the angle between the two vectors 


= <[i(0,V ;t £(x)] (A5.11) 

must be bounded as follows 

-180° < ijj< -90° (A5.12) 

This, in tum, means that the State velocity vector x (r) must have a negative pro- 
jection on the gradient vector, V K E(x). This also indicates that the State velocity 
vector x ( t ) has a positive projection on the negative gradient vector, -V r £(x). 

As a final note, the energy function having physical energy meaning often 
fits as a Lyapunov function. In many cases, however, in which a system model 
is described with differential equations, it may not be clear what “energy” of the 
system means. Therefore, the conditions that E(x) must satisfy to be a Lyapunov 
function are based on mathematical rather than physical considerations. For our 
neural network studies of fully coupled single-layer networks, the Lyapunov func- 
tion is often called the computational energy function and has no physical energy 
relation. Note also that the inability to find a satisfactory Lyapunov function does 
not mean that the evaluated system is unstable. 

No unique and best method exists to identify a Lyapunov function for a 
given system. A form E(x) can be assumed either as a pure guess or obtained 
by physical insight and energy-like considerations. Following the choice of the 
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hypothetical energy function which is positive definite, dE(x)/dt needs to be 
tested involving the equations f(x) =x in the process. 

EXAMPLE A5.1 


Let us check the stability of the closed feedback loop control system shown 
in Figure A6. The system is described by the equation 

U(S) - = Q(s) 

$(s + a) 


where a and K are positive constants. To evaluate the stability of the closed- 
loop system we use the equation for an autonomous system and assume 
U(s) = 0. This corresponds to the relationship 


s 2 Q(s) + asQ(s) + KQ{s) = 0 


In the time domain, the autonomous system is described with the equation 

q +a q +Kq = 0 

The following definitions of State variables can be introduced in order to 
identify the system of first-order differential equations: 

A 

*1 =q 

A . 

*2 =4 

Since x 2 =x u we have x 2 = q , and Equation (A5.8) can be written as follows: 

x x = 0x x + lx 2 
x 2 = —Kx x — ax 2 


We thus have the x= f(x) relationship as given below: 


x= 

0 

r 

x, where x = 



-K 

— a 


x 2 


Let our first guess for the energy function E(x) be 

£(x) = xf + 



Figure A6 Illustration for Lyapunov function evaluation in Example A5.1. 
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Since the function is positive definite and fulfills condition 1, checking 
condition 2 yields: 


dE\m\ 

dt 


r d*i i 

d£(x) dE(x ) ' 
dX\ dx 2 

L dt J 


dt 
dx 2 


2xj Xj +2x 2 x 2 


Using the equations describing the system, the rate of change of the energy 
function in time becomes 


Æ[ * (f)] = 2XjX 2 + 2 x 2 (-Kx x - ax 2 ) = 2x,x 2 (l - K) - 2 ax 2 

dt 

To check whether or not this result is always positive independent of x we 
can make it equivalent to the following quadratic form: 


2xjX 2 (1 — K) — 2ax^ = [x! 



Since detA u =0, the derivative of the energy function is not negative and 
the system cannot be asserted to be stable. In faet, it cannot be called an 
unstable system either. However, another choice of the Lyapunov function 
can be attempted. Assuming 

E(x) = Kx\ + (x 2 + ax,) 2 

we have selected a positive definite function satisfying condition 1. Checking 
condition 2 yields: 


dE[x(0] _ p x x K + 2(x 2 + ax x )a]x 2 + 2(x 2 + ax^—Kxi - ax 2 ) = -2 aKx\ 

dt 

This concludes the proof that the chosen energy function has a negative 
definite derivative versus time and therefore the system is stable. wm 


A6 i » 

ANALYTIC GEOMETRY 

IN EUCUDEAN SPACE IN 
CARTESIAN COORDINATES 

The point P 0 (x 0 ,y 0 ) in the two-dimensional Euclidean space in Cartesian 
coordinates can be represented as a two-component column vector x 0 defined as 
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Equation Forms of a Straight 
Line on a Plane 

For this section, refer to Figure A7. 

1. The gradient form: 

y - kx + q (A6.1) 

2. The intercept form: 

- + y - = \,p±0,q±0 (A6.2) 

p q 

where p and q are intercept coordinates with the x and y axes, respectively. 

3. The general form 

ax+by + c = 0, a 2 + b 1 > 0 (A6.3) 

where constants a and b are coordinates of a vector n: 

a 

n= [b\ 

which is normal to the line. The vector is called a normal vector and has 
the property that it points to the side of a line for which ax + by + c > 0, 
which is called the positive half-plane. This property is proven below. 


r 

Positive side >’o) 



Figure A7 Illustration of equation forms for a straight line. 
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Indeed, for a point [jc y] f on the line we have from (A6.3) 

+ c = 0 (A6.4) 


Assume that a vector Ax has been added to x so that point Q(x + Av, y + Av) 
is on the upper side of the line of Figure A7. Computing from (A6.4) 
yields 


x + Ax 

[y + AyJ 


+ c 


=(-'[;] 

i r Aci 

= n' . 

L^J 


+ c) + ri 


Ax 

LAyJ 


(A6.5) 


Since the vectors n and Ax = [ Av Ay ] ' are both nonzero and the angle 
between them is between —90° and 90°, their scalar product (A6.5) is 
always positive. This proves that n always points to the positive half-plane. 
Note that the positive half-plane is denoted by + signs, or by a small 
arrow pointing toward the positive side of the line. 

Note that this equation form can be used with the unit normal vector 
r rather than the normal vector. 


4. The two-point form: 


y-y i 

x — x x 


yzzii 

*2 - V 


or 


JC 

*\ 

x 2 


y i 
y\ 1 
i 


= o 


(A6.6a) 

(A6.6b) 


The line passes through points F 1 (jr,,y 1 ) and P 2 (x ’^yf) 

5. The normal vector-point form: 

n'(x - Xj) = 0, or (A6.7a) 

fi\(x X!> ■+■ n 2 (y yi) ~ 0 (A6.7b) 

The line is perpendicular to its normal vector n = [ n 2 ] ' and passes 
through point P x (x x ,y x ). This form of the equation remains valid also for 
the unit normal r used rather than the normal vector. Indeed, dividing both 
sides of (A6.7a) by (n\ + n\) 112 we obtain: 

Pjj(x “ *i) = 0 (A6.8) 

Using the definition of the unit normal vector 


a n 

r = r t — w 
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Figure A8 Illustration of a line bisecting the segment between x 1 and x 2 . 


we obtain an alternative normal vector-point form equation: 

r'(x - x,) = 0 (A6.9) 

6. The form for bisecting a segment between \ x and x 2 : 

(x, - x 2 )'x + i (||x 2 || 2 - lix, || 2 ) = 0 (A6.10) 

This line passes through the center of gravity of two points — point 
PiCxi,*) represented by the vector x l9 and point P 2 ( JC 2 * 3 ; 2 ) represented by 
the vector x 2 — and is perpendicular to the line’ s normal vector x x — x 2 . 
The coordinates of the center of gravity point P 0 are 0.5(x 1 + x 2 ). This line 
is illustrated in Figure A8. 

The derivation below of Equation (A6.10) is based on the Equation (A6.7a). 
Indeed, the line of interest is normal to the vector x x — x 2 and passes through 
the center point P 0 . We thus have from (A6.7a) 

(xj - x 2 /(x — Xq) = 0 (A6.11a) 

which is equivalent to 

(x, - x 2 )'x - ~(x, - x 2 )'(x 1 + x 2 ) = 0 (A6.11b) 

Performing multiplication and reducing similar terms we obtain 

(x x - x 2 )'x + i(x'x 2 - x'xj) = 0 (A6.11C) 

Using the identity x'x = ||x|| 2 we notice that (A6.11c) becomes identical to 
(A6.10). 
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Useful Relationships 


For this section, refer to Figure A7. The distance g from the line ax + by + 
c = 0 to the point FoC^O'-Vo) i s 


ax o + by 0 + c 
Va 2 + b 2 


(A6.12) 


For special cases when the point P 0 is the origin, the distance from the origin 
to the line equals 


8 = 


c 

\fcP- + F 


(A6.13) 


Note that the distance is always a non-negative quantity by definition. 


EXAMPLE A6.1 


This example reviews the main forms of equations of a straight line and 
illustrates useful relationships. Let us begin by finding the gradient form of 
an equation for the line shown in Figure A9. Since 

k = tan i/r = q = — 3 



Figure A9 Illustration for Example A6.1. 
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then Equation (A6.1) has the form 


y — 2 X ~ ^ 


Since p — 2 and q — —3, the intercept form of equation is 

x 
2 

The general form of the equation can be obtained by rearranging either of 
the two forms above to the following form: 


§- 


3x — 2y — 6 = 0 

The resulting normal vector perpendicular to the line and pointing to the 
positive side of the line is [3 —2] / 

The unit normal vector r is 


r — 


1 

7l3 


[ 4 ] 


V3 2 + 2 2 

Note that the length of the normal vector remains irrelevant for this form 
of line equation. Indeed, [3/2 — 1 ] f is another normal vector to the line. 

The line’s general equation for this normal vector is obtained by dividing 
the original line equation in the general form by 2 as given below: 

\x — y - 3 = 0 
2 

The distance from the line to the origin is 

6 


8 = 


IV3 2 + 2 2 | v^l3 
The two-point form equation (A6.6) for the line passing through points (2, 0) 
and (0, -3) is 

y = zl 

x-2 -2 

Knowing the line normal vector and a single point P on the line, the equation 
as in (A6.7) can be obtained: 


3 

-2 


'M" 


or 


3x — 2y — 6 = 0 

This concludes the review of the example forms of straight line equations.i 


A6 ANALYT1C GEOMETRY IN EUCUDEAN SPACE IN CARTESIAN COORDINATES 


A35 


Equation Forms of a Plane in 
Three-dimensional Space 


1. The intercept form: 

- + - + - = 1, p + 0, q + 0, s # 0 (A6.14) 

p q s 

where p, q and s, are plane intercept coordinates with the x, y and z axes, 
respectively. 

2. The general form: 

ax + by + cz + d = 0, a 2 + b 2 + c 2 > 0 (A6.15) 

The constants a, b and c determine the normal vector n 


n - 


a 

b 

c 


which points to the positive side of the plane, usually denoted by an arrow. 
Equation (A6.15) can be rewritten in vectorial notation: 


+ d = 0 


(A6.16) 


3. The three-point form: 


X 

y 

z 

1 

*1 

y i 

Zl 

1 

*2 

?2 

z 2 

1 

*3 

3*3 

z 3 

1 


= 0 


(A6.17) 


The plane passes through three non-linear points f > 1 (x 1 ,y 1 ,z l ), P 2 {x 1 ,y 2 ,z 2 ), 
and P 3 (x 3 ,y 3 ,z 3 ). 

4. The normal vector-point form: 

a(x — jtj) + b(y — y!> + c(z — z t ) = 0 
or 

n'(x — Xj) = 0 

where 


(A6.18a) 

(A6.18b) 


X = 

1 

H ^ 

i 

, and x x = 

X\ 
y i 


L Z J 


- z i. 


The plane is perpendicular to the normal vector n and passes through point 
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P\{x\*y\*z\)- Th* s form or equation also remains valid for the unit normal 
vector n used rather than normal vector. 

5. Th z form for bisecting a segment between X! and x 2 : 

(X! - x 2 )'x + i (||x 2 || 2 - ||x 1 1| 2 ) = 0 (A6.19) 

This plane passes through the center point between two points — / > 1 (x 1 ,> , 1 ,z ) ) 
represented by the vector x, , and P 2 (x 2 ,y 2 ,z 2 ) represented by the vector 
x 2 — and is perpendicular to the plane’s normal vector X! - x 2 . The 
coordinates of the center point P 0 are 0.5(X! + x 2 ). 


Useful Relationships in Three-dimensional 
Euclidean Space 


1. Distance of point ^o^O’Jo^o) from die plane ax + by + cz + d = 0 

| ax 0 + by 0 + cz 0 + d 


8 


Va 2 + b 2 + c 2 

When the point P 0 is the origin, the distance (A6.20) reduces to 

d 


(A6.20) 


8 = 


\Za 2 + b 2 + c 2 


(A6.21) 


2. The distance between two points P^ix^yx^x) and / > 2 (x 2 ,y 2 ,z 2 ) is equal to 
the length of vector X! — x 2 : 


II*! - X 2 II = y/(x 1 - x 2 ) 2 + Oi - y 2 ) 2 + (z x ~ z 2 > 2 • (A6.22) 


where 


3. The mid-point of the line segment connecting points P\ and P 2 as in 
(A6.22) corresponds to a vector x m such that 


*1 

A 

*1 

yi 

- *2 = 

yi 

.*i. 


.22. 


Xi + Xo 


x m = 


(A6.23) 


4. The center of gravity of a system with K masses m k distributed at points 
x k is at x c such that 

K 

I m k x k 


k= 1 


K 

1 

*= 1 


x { 


(A6.24) 
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Note that the center P(x,y,z) of mass of a triangle is given by 


x> + x 2 + . 

X c = - J i.e., 

_ x l +x 2 +x 3 _ yi + y 2 + y 3 _ _ Zj + z 2 + z 3 
X 3 3 3 


EXAMPLE A6.2 


This example reviews some forms of equations for a plane and illustrates 
useful relationships. Let us find the equations of a plane perpendicular to the 
vector [2 3 6]' and passing through point (0,7,0). The normal vector- 

point equation (A6.18) for the plane can be obtained: 

2x + 3(y - 7) + 6z = 0, or 
2x + 3y + 6z — 21 =0 

To sketch the plane, it is convenient to rearrange the plane equation to the 
intercept form (A6.14) as follows: 

— - — + — H — = 1 

21/2 7 7/2 

The plane is shown in Figure A10. The unit normal vector for the plane is 

r 2i 



X 


Figure A10 Plane for Example A6.2. 
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Note that the distance between the origin and the plane is 


g = 


21 

V2 2 + 3 2 + 6 2 


= 3 


Since the normal vector points to the upper part of the plane, the upper 
part is the positive half-space; the lower part, including the origin, is the 
negative half-space. m 


SELECTED NEURAL 
NETWORK RELATED 
CONCEPTS 


Exclusive OR (XOR) Operation 

The Exclusive OR, denoted with operator ®, is a binary operation on logic 
arguments Xj and x 2 . It assigns the value 1 to the two arguments if and only if 
they have complementary values; that is, x, ©x 2 = 1 if either x, or x 2 is 1, but 
not when both x l and x 2 are 0, or when both x, and x 2 ~ 1. The truth table for 
Boolean variables 0, 1 is 


*1 

*2 

X x © x 2 

0 

0 

0 

0 

1 

1 

1 

0 

1 

1 

1 

0 


Let us note that the operation is associative since 

(Xj © x 2 ) © x 3 = X, © (x 2 ©x 3 ) = Xl ©x 2 ©x 3 

This allows the extension of the XOR operation for n logic arguments as 

XOR(xj,x 2 ,...,x„) = Xj © x 2 © . . . © x„ 

The multivariable XOR function is called the parity function since it assigns the 
value 1 to n arguments if and only if the number of the arguments of value 1 
is odd. 
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376 
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370-75 

Asynchronous stochastic recur- 
sion, 258-61 

Asynchronous updating, 257-61 
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Autoassociative memory chip, 
645, 647-48 

Autocorrelation matrix, 324, 328 
Autocorrelator, 324 
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system for, 228-29 
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8-10 
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Error back-propagation 
training 

Baird, H. S., 476 
Barto, A. G., 490, 499, 511, 513, 
548 

Basin of attraction, 253 
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76, 354-70 

architecture of, 355-57, 358 
capacity of, 363 
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357-59 
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improved coding with, 363-68 
spatio-temporal pattems and, 
370-75 

stability considerations with, 
359-60 

Bidirectionally stable memory, 
359-60 

Binary activation functions, 36 
bipolar, 33, 39 

Hebbian leaming rule and, 61-63 
perceptron leaming mle and, 
64-66 

unipolar, 33-34 
Binary adder network, 287-94 
Binary pattems, restoration of, 13 
Binary perceptrons, 35, 36 
Binaiy weights, in associative 
memory implementation, 
645-48 


Biological neurons, 26-36 
firing of, 29 

McCulloch-Pitts model of, 30- 
31 

modeling for artificial neural 
systems, 31-36 
structure of, 27-28 
threshold of, 29 
time units for modeling of, 29 
Biomedical signal processing, 
classifiers and, 6-7 
Bipolar activation functions 
biological neuron model, 33, 
34, 36 

feedforward network, 39 
Hebbian leaming mle and, 63 
perceptron leaming mle and, 

64 

Bipolar heteroassociative mem- 
ory, 367 

Bipolar threshold fimction, bidi- 
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and, 355 

Bistable flip-flop, 330-36 
Blackman, D., 653, 658, 660 
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function 
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Braun, L., 487 
Brobst, R. W., 528, 531 
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Campagna, D. S., 504, 508, 509, 
510, 512 
Cao, W., 217 

Capacitance, in electrical model 
of Hopfield network, 266 
Capacitance plant, transfer func- 
tion for, 486, 487 
Capacitive Circuit element, stor- 
age of charge across, 592-95 
Capacitor, natural decay of 
charge, 594 
Card, H. C., 605, 606 
Cardio, R., 220 
Carpenter, G., 20, 432 
Cart-pendulum controller, 
499-503 

Cartesian space partitioning, 
three-dimensional, 4-5 
Cascode current mirror, 612, 613 
Categories, and counterparts in 
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Category membership, classifiers 
and, 4-7. See also Classifi- 
cation; Classifiers 
Central processing unit (CPU), 
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Cerebellar model articulation 
controller (CMAC), 489, 
504-11, 512 
Chams, A., 572, 665 
Chandra, V., 464 
Channel length modulation fac- 
tor, current mirrors and, 
610-11, 612 
Channel resistance, 588 
Chapman, 418 

Character recognition networks, 
221-25, 464-85 
based on error back-propaga- 
tion training, 482-85 
based on handwritten character 
skeletonization, 478-82 
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221-25, 464-76 
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its in, 476-78 

Characteristic surface, 504-6, 

507, 508 
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able, 569-70 
Chen, H. H., 377 
Chevroulet, M., 592, 594 
Chin, B. L., 653, 660, 661, 663 
Chin, C., 299 

Chip circuitry. See also Inte- 
grated circuits 
associative memory, 645, 
647-51 

ETANN 80170, 653-57 
leaming and, 664 
MDAC, 600-603, 604 
Chip-in-the-loop leaming, 653 
Chiu, T., 653, 658, 660 
Choi, M. R., 665 
Chua, L. O., 271, 299 
Chung, H. W., 628 
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associative memory, 644-52. 
See also Associative memo- 
ries 

capacitive, 592-95 
electronic, 575-85 
integrated, 585-608, 609. See 
also Integrated circuits 
MDAC, 595-604 
sample-and-hold, 593-95 
scalar product. See Scalar 
product circuits 
Clare, T., 653, 658, 660 
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Classical conditioning, 76, 77 
Classification, 53-54, 94-95. See 
also Categories; Clustering 
correlation, 109 
discrete perceptron in, 120-32 
geometric description of, 97-98 
linearly nonseparable, 165-75 
minimum distance, 108-14 
multilayer feedforward net- 
works in, 220-25 
unsupervised, 58 
Classification error, minimization 
of, 133-43 


Classification response, 54 

Classifiers, 4-7, 391-98. See 
also Classification; specific 
type 

Closed-loop feedback system, in 
target position leaming, 

524 

Cluster discovery network, 
432-44 

Clustering, similarity measures 
and, 399-401 

Clustering networks, 16, 76, 
399-410 

design of neural network for, 
581-85 

initialization of weights in, 
406-7 

Kohonen, 401-5 
recall mode and, 406 
separability limitations of, 
409-10 

Clusters, unsupervised leaming 
of, 58, 399-410 

CMAC (cerebellar model ar- 
ticulation controller), 489, 
504-11, 512 

CMOS current comparator, 
626-28 

CMOS inverter circuit, 613-17, 
618 

CMOS technology, 608, 609 
analog multiplier with weight 
storage and, 640 
associative memory and, 645, 
650, 651 

Cocontent function, 271 

Coding. See Encoding and de- 
coding 

Collins, D. R., 664, 665 

Combinational optimization 
tasks, 16, 294-99 

“Committee” networks, 173. See 
also Multilayer feedforward 
networks 

Communication traffic, routing 
of, 298 

Competitive leaming. See also 
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leaky, 405 


winner-take-all leaming rule 
and, 70-71 
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analog, 575-80 
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work, 77 
mixed, 567 

by neural networks, 2. See also 
Neural computation 
Computational bandwidth, 571, 
572 

Computational energy function. 

See Energy function 
Computers. See Hardware 
Computing nodes, 569 
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in electrical model of Hopfield 
network, 266 
photosynapse and, 661 
Conductance matrix, 289 
Connection matrix. See Weight 
matrix 

Connection strength, 2 
Connectionist expert systems, 
226-29, 527-39 
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sis, 537-39 
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532-36 

for skin disease diagnosis, 
528-32 
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571, 572-73 

Constrained freedom pattem data, 
415-16 
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545-46 

Continuous activation functions 
biological neuron model, 

33-34 

bipolar, 33, 34 
delta leaming rule and, 

66-69 

feedforward network, 41, 42 
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Continuous function approxima- 
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Continuous function approxima- 
tion error, 203 
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Continuous-time feedback net- 
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electric, 49-52 
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within, 48-49 
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287-99 

transient response of, 276-83 
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487-88 
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dynamic back-propagation and, 
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Intelledex 605T and, 524 
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94 

Conventional computers, parallel 
distributed processing hard- 
ware vs., 568-69 
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expert system for, 537-39 
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120-23 

Correction rules, 122-23 
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Correlation leaming rule, 69-70, 
73 

Correlation term, Hebbian leam- 
ing rule, 61 
Cosi, P., 220 

Cost function, linear program- 
ming and, 457, 463 
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trajectory mapping with, 
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CPS units, 571, 572-73 
CPU (central processing unit), 
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Cross-talk noise, 323 
Cruz, J. B., 365, 366, 367, 368 
Cumulative error, 195-96 
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Current mirrors, 610-13 

differential voltage amplifiers 
and, 621-22, 623 
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tive memory design, 648, 
649 
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weighted-resistor MDAC, 596 
Cybenko, G., 220, 222 
Cybemetics, 19 

D/A conversion. See Digital-to- 
analog conversion 
DARPA, 571 

Data, dimensionality reduction 
of, 16-17 

Data clusters. See Cluster dis- 
covery network; Clustering 
networks; Clusters 
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chitectures vs., 214-16 
Data transfer, speed of, 572 
Davenport, M. R., 377 
Dayhoff, J. E., 28, 464 
Dayhoff, R. E., 464 
Decision error, 196 
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linearly nonseparable pattem 
classification, 165-66 
single-layer perceptron classi- 
fier, 116-17 
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Decision surfaces, 98, 101 
discrete perceptron classifier 
training, 125, 126 
discriminant functions and, 
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Decoding. See Encoding and de- 
coding 

Dedicated VLSI chips, 658, 660 
Delay network, 47-49. See also 
Time-delay network 
Delta, error signal term, 177, 178 
Delta function, Kronecker, 297, 
328 

Delta leaming rule, 66-69, 73, 

77 

continuous perceptron and, 
137-42 

generalized, 181-85 
multiperceptron layer and, 
175-81 

Widrow-Hoff leaming rule as 
special case of, 69 
Delta training /leaming weight 
adjustments, 178 
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system for, 528-32 
Desai, M. S., 345, 464, 466, 470 
DESKNET expert system, 

528-32 

Deweerth, S. P., 665 
Dichotomization, triple, 150 
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sifiers 

linear discriminant function 
and, 107-8 

TLU-based perceptron, 115 
weight adjustments of, 117-20 
Dichotomizer training 

absolute correction rule and, 
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126-27 

Differential current 
ETANN 80170 chip, 655, 656, 
657 

floating-gate transistor multi- 
pliers, 642-43 
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617-24 
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work with, 287-94 
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analog processor and, 658 
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Dijkstra, E., 592, 594 
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ture detection and, 16-17, 


415-16 
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ble classification, 170, 172 
R-category classifier training, 

144- 46 
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sociative memory, 358. See 
also Bidirectional associa- 
tive memory 

Discrete-time feedback networks, 
42, 43, 44-49, 252, 254-64 
Discrete-time Hopfield networks, 
mathematical foundations of, 
254-64 

Discriminant functions, 99-106 
contour map of, 102 
linear, 107, 108, 111-13 
multicategory classifiers and, 
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Disease diagnosis. See Connec- 
tionist expert systems 
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efficiency of, 567-69 
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leaming control årchitecture 
and, 495 

transfer functions of, 486-87 
Dynamical systems, 251-310. 
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EBPTA (error back-propagation 
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Electroencephalogram (EEG), 76 
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flip-flop, 330-36 
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47- 49 
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feedback network, 

48- 49 
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ing with, 506-7 
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Encoding and decoding 
bidirectional associative mem- 
ory, 357-59, 363-68 
multiple training, 365-68 
Energy function 
discrete-time Hopfield net- 
work, 258-60 
flip-flop network, 332, 333 
in four-bit A/D conversion 
problem, 288-89 
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work, 268-76 
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ory, 327-28 
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ory, 327-28, 340, 342-43 
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time derivative of, 461 
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278-83 
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work, 269-70 
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457-64 
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ory and, 342-43 
Engineering, optimization in, 14 
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636-38 
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196 

training, 195-96 
Error back-propagation training, 
175, 186-95 
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25, 466-76, 482-85 
dynamic, 513 
example of, 190-95 
expert systems using, 226-29 
forward kinematics problem 
and, 520-21, 522 
leaming control systems and, 
497-98 

with output splitting, 520-21, 
522 

photosynapses and, 662, 663 
target position leaming and, 
523-27 
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Error back-propagation training 
algorithm (EBPTA), 187, 
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convergence of, 207-13 
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tion of, 205, 207 
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using, 133-42 
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error back-propagation and, 
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ETANN 80170 chip, 653-57 
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connectionist, 226-29, 527- 
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works and, 225-29 
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discrete-time, 42, 43, 44-49 
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42-43 

leaming in, 56 
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Feedforward control. See also 
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38 
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error back-propagation training 
and, 186-95 
expert, 225-29, 537-39 
feedforward recall and, 185— 

86 

in forward and inverse kine- 
matics problems, 516-19 
functional link, 230-33, 234, 
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differential. See Differential 
current 

Output error, 497-98 
Output neurons, number of, 
215-16 

Output pattem. See Output vec- 
tors 

Output signal, 32 
Output splitting, error back- 
propagation training with, 
520-21, 522 

Output update rule, for Hopfield 
autoassociative memory, 327 
Output vectors 

biological neuron model, 32, 36 
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fiers, 624-25 

Scheduling, job-shop, 16, 298 
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Weight(s) 

ETANN 80170 chip, 653-55 
feedback network, 48-49 
integrated Circuit, 585-608, 
609 

neuron models with, 35 
photosynaptic, 660-62, 663 
Synchronous State updating algo- 
rithm, 261-62 
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